Forgot your password?
typodupeerror
Math Entertainment Games

Alternate Baseball Universes 229

Posted by kdawson
from the say-it-ain't-so-joe dept.
Jamie found a NYTimes op-ed by a grad student and a professor from Cornell, outlining some research they did into alternate baseball universes. The goal was to find out how unlikely in fact was Joe DiMaggio's 56-game hitting streak, played out in the 1941 season. No one since has even come close to that record. The math guys ran simulations of the entire history of baseball from 1885 on — 10,000 of them. For each simulation they put each player up to the plate for each at-bat in each game in each year, just like it happened; and they rolled the dice on him, based on his actual hitting stats for that season. (Their algorithm sounds far simpler than whatever the Strat-O-Matic guys use.) The result: Joltin' Joe's record is not merely likely, it's basically a sure thing. Every alternate universe produced a streak of 39 games or better; one reached 109 games. Joe DiMaggio was not the likeliest player in the history of the game to accomplish the record, not by a long shot.
This discussion has been archived. No new comments can be posted.

Alternate Baseball Universes

Comments Filter:
  • by Martin Blank (154261) on Sunday March 30, 2008 @05:19PM (#22915094) Journal
    No, they reran 1871-2005 through the simulator a total of 10,000 times. This is clear not only from the statement that says as much ("Using a comprehensive collection of baseball statistics from 1871 to 2005, we simulated the entire history of baseball 10,000 times in a computer"), but from the mention that the record was set most often in 1894.
  • by kevinatilusa (620125) <kcostell.gmail@com> on Sunday March 30, 2008 @05:25PM (#22915156)
    From the descriptions I've seen of their research, it seems that they're treating all games identically for the purpose of determining a typical season's behavior. While this may me necessary to make the computation tractable, it's not realistic, and introduces a sizable bias towards long hitting streaks.

    In reality, a league is typically very imbalanced from team to team and from pitcher to pitcher (probably even more so in the game of the early 20th century than now). It's easier to get hits off of two successive average pitchers than it is to get hits both off of a very good and a very bad pitcher. For example (to oversimplify a good deal):

    Say the league is split 50/50 between "good" pitchers (pitchers you'll get a hit off of 50% of games) and "bad" pitchers (pitchers you'll get a hit off of 80% of games). In a typical 20 game stretch, you'll encounter 10 good pitchers and 10 bad ones, and your odds of getting a hit in all 20 games would be (0.50)^10(0.80)^10, about 1/9537.

    Under their analyis as I understand it, they'd replace all the pitchers by mediocre pitchers who you'd get a hit off of 65% of the time, and your odds would be (0.65)^20, about 1/5517.

    This one assumption almost doubled your chances of getting a hit in all 20 games.

    There are other biases as well going the other way (ignoring the effect of hitting slumps, for example), but this one jumped out at me.
  • by rufusdufus (450462) on Sunday March 30, 2008 @05:38PM (#22915248)
    Our simulations did something very much like this, except instead of a coin, we used random numbers generated by a computer.
    It is not mathematically sound to do statistics with a random number generator. Computers do not actually generate random numbers, but instead, they can only make pseudo-random numbers that have a certain distribution.
    Any 'simulation' done in this way will always have a bias.
    In order to get correct statistics, you must actually compute the statistics.
  • by kevinatilusa (620125) <kcostell.gmail@com> on Sunday March 30, 2008 @05:47PM (#22915326)

    It is not mathematically sound to do statistics with a random number generator. Computers do not actually generate random numbers, but instead, they can only make pseudo-random numbers that have a certain distribution. Any 'simulation' done in this way will always have a bias. In order to get correct statistics, you must actually compute the statistics.
    Sure, the proper way to put it mathematically would have been "we did a Monte-Carlo based simulation of the probability distribution of the longest hitting streak under our model due to the intractability of direct computation", but this is an editorial in the New York Times, not a mathematical journal! As a side note, just because a computation is performed on a set of pseudorandom numbers does not mean it is biased...usually the whole point of pseudorandomness is that the discrepancy between computations involving them and identical computations involving true random numbers will typically be quite small.
  • by Anonymous Coward on Sunday March 30, 2008 @05:53PM (#22915380)
    I think it's safe to say that most all statistics uses a random number generator. Computers *do* have the capability to produce true random numbers, as shown with /dev/random [wikipedia.org], which relies on an entropy pool and is suitable for cryptographic key generation.
  • by Frequency Domain (601421) on Sunday March 30, 2008 @06:00PM (#22915432)
    No bashing, it's not a bad question. The answer is because it still qualifies as a "rare event". The thing that's kind of counter-intuitive, but easy to demonstrate, is that having a particular rare event happen is rare, but having some rare event happen is common.

    A good illustration of this is the so-called "birthday paradox", which asks what's the probability of having duplicate birthdays in a group of n people (whose birthdays are independent of each other). Think of adding the people to the room one by one. The first person doesn't have any chance of having a duplicate birthday, because there's nobody else in the room. The second person has 1/365 chances of duplicating, 364/365 of missing the first one. Let's follow up on the misses, they're easier to work with. In general, if we've got k people in the room without a duplicate, that means they've used up k of the 365 days in the year, and the next person we introduce to the room has to miss all of those days to avoid a duplication. So the probability of everybody missing everybody else, by the time we get up to n people in the room, is (365/365)*(364/365)*(363/365)*...*((365-n+1)/365), which starts diving towards zero really fast. The probability of having one or more duplicates is 1 - P(no duplicates), which correspondingly climbs to one really fast. If you write a short program to do the exact calculations, you'll find that by the time you have 23 people in the room the probability is greater than 0.5 of having a duplicate, and by the time you get 57 people it's greater than 0.99!

    If you pick one particular person and ask what's the probability of duplicating that birthday it remains quite small. That's the difference between having a particular rare event rather than having some rare event. For a large enough group, some pair of people will almost surely share a birthday but the odds of it being you (or any other designated person) remain quite small.

    Just to preserve my computing geek cred, this is why you need collision resolution for hashing algorithms. You don't know which entries will share hash values, but collisions are almost certain to happen by the time you've loaded 3 * sqrt(Hash Table Capacity) values, e.g., if your hash table has capacity 10000 you will almost surely see a duplicate within the first 300 entries.

  • This seems relevant:

    http://abcnews.go.com/Technology/WhosCounting/story?id=3694104&page=1 [go.com]

    Disclaimer: I'm not an American, so I know next to nothing about baseball - and care less!
  • by Anonymous Coward on Sunday March 30, 2008 @06:07PM (#22915470)
    No, it won't always have a "bias". Bias is a technical term here, and infact there is not likely to be bias. Bias is where the long-run average value of simulated variables is not equal to the actual average value of the thing you're simulating. For example, rolling a chipped die to draw numbers uniformly from 1 to 6 will probably cause this.

    The problem with pseudorandom number generation tends to be dependence between samples (barring a more serious bug, which has happened... but this is always a problem, and there can also be bugs in the rest of the code anyway). Now this correlation is a problem for cryptography maybe, since there is intense interest in every bit of entropy in a very short signal, and a lot of clever guys hacking at it.

    However in statistics, you basically just use the random numbers as "fuel" for a sequence of very stupid computations (more or less, glorified averages and averages of squares, &c.). The functions used in statistics are just too stupid to find out that the numbers have inter-dependence, so that they tend to give the same results for pseudo-random numbers as for real numbers. This is thanks to a lot of hard work from many fields, to improve pseudorandom number generators.

    In fact, and as a tangent, theoretical computer scientists tend to believe that any randomness in an algorithm can be replaced by deterministic functions! (although they don't believe this as widely as they believe P!=NP). Since we can consider any statistical procedure an algorithm, the effect (at least philosophically) that this would have on many applied fields is mind-boggling. I would love to a proof and some general techniques for this "derandomization" - if there were one, we could finally absolve ourselves of our state of sin [wikiquote.org]. (It would also imho inform the "free will" debate a bit.)
  • by Vellmont (569020) on Sunday March 30, 2008 @07:17PM (#22915958)

    Computers do not actually generate random numbers

    That'll be a surprise to the multiple true random number generators build into most operating systems. There's many sources of random data in a computer. Timing between keystrokes, timing of mouse movements, network latency between packets, and of course hardware random number generators that use thermal noise as its source.

    So to put it mildly, computers can, and DO generate truly random numbers that are completely unpredictable and free from bias.

    (Oh, BTW, to do a Monte-Carlo simulation (which the referenced article is) you actually don't need true random numbers, you only need a pseudo-random source that's free from bias. Those pseudo-random sources do exist, and aren't that even that difficult to code.)
  • Comparison of Sports (Score:2, Informative)

    by buildguy (965589) on Sunday March 30, 2008 @07:39PM (#22916104)
    Interesting comparison made on this page, but I'm not sure if it is accurate. http://en.wikipedia.org/wiki/Don_Bradman#World_sport_context [wikipedia.org]
  • by Vellmont (569020) on Sunday March 30, 2008 @08:35PM (#22916492)

    Unless you are dealing with quantum effects, you are not dealing with something truly random.


    From wikipedia on "electronic (thermal) noise":

    In any electronic circuit, there exist random variations in current or voltage caused by the random movement of the electrons carrying the current as they are jolted around by thermal energy.

    Is that quantum mechanical enough for you?

    As for network latency between packets, while it may not be random on a quantum-mechanical level, it's still unpredictable unless you can get on the same lan segment as the target computer. The keyboard timings are taken on a small enough time scale that they're quite unpredictable, and not related to the typist.
  • Re:Nerves (Score:2, Informative)

    by cleatsupkeep (1132585) on Sunday March 30, 2008 @08:58PM (#22916628) Homepage
    I remember reading an article saying that "clutch performers" don't really exist - and that the reason we believe they do is because of the same biases that make us cling to our beliefs - taking note of something when it fits your belief and tossing it away when it disagrees.

    Wikipedia to the rescue (http://en.wikipedia.org/wiki/Clutch_(sports) [wikipedia.org])

    Some sports analysts have presented evidence that while individual plays and moments may resonate as "clutch" because of their importance, there is no such thing as "clutch ability" or an inherently clutch player. One example of such an argument is presented in the 2006 book Baseball Between the Numbers published by Baseball Prospectus, which compiles evidence that no baseball players are demonstrably consistently clutch over the course of a career, and that the numbers of allegedly clutch players in clutch situations are in fact no different from players reputed to be "chokers."[1]

    The rest of the page does well with some very good examples from baseball - including Derek Jeter and Reggie Jackson (both Yankees - maybe the author was a Red Sox fan).
  • Modern Intel motherboards (i810 forward) and AMD motherboards (768 forward) have a hardware RNG (Random Number Generator) that IIRC is based on diode noise. That's straight up quantum randomness, and most modern Linux distros automatically detect and use it if available.

He keeps differentiating, flying off on a tangent.

Working...