Alternate Baseball Universes 229
Jamie found a NYTimes op-ed by a grad student and a professor from Cornell, outlining some research they did into alternate baseball universes. The goal was to find out how unlikely in fact was Joe DiMaggio's 56-game hitting streak, played out in the 1941 season. No one since has even come close to that record. The math guys ran simulations of the entire history of baseball from 1885 on — 10,000 of them. For each simulation they put each player up to the plate for each at-bat in each game in each year, just like it happened; and they rolled the dice on him, based on his actual hitting stats for that season. (Their algorithm sounds far simpler than whatever the Strat-O-Matic guys use.) The result: Joltin' Joe's record is not merely likely, it's basically a sure thing. Every alternate universe produced a streak of 39 games or better; one reached 109 games. Joe DiMaggio was not the likeliest player in the history of the game to accomplish the record, not by a long shot.
Re:If its so likely, they why hasn't it happened? (Score:4, Informative)
Too many assumptions? (Score:5, Informative)
In reality, a league is typically very imbalanced from team to team and from pitcher to pitcher (probably even more so in the game of the early 20th century than now). It's easier to get hits off of two successive average pitchers than it is to get hits both off of a very good and a very bad pitcher. For example (to oversimplify a good deal):
Say the league is split 50/50 between "good" pitchers (pitchers you'll get a hit off of 50% of games) and "bad" pitchers (pitchers you'll get a hit off of 80% of games). In a typical 20 game stretch, you'll encounter 10 good pitchers and 10 bad ones, and your odds of getting a hit in all 20 games would be (0.50)^10(0.80)^10, about 1/9537.
Under their analyis as I understand it, they'd replace all the pitchers by mediocre pitchers who you'd get a hit off of 65% of the time, and your odds would be (0.65)^20, about 1/5517.
This one assumption almost doubled your chances of getting a hit in all 20 games.
There are other biases as well going the other way (ignoring the effect of hitting slumps, for example), but this one jumped out at me.
You can't do statistics with a random # generator (Score:2, Informative)
It is not mathematically sound to do statistics with a random number generator. Computers do not actually generate random numbers, but instead, they can only make pseudo-random numbers that have a certain distribution.
Any 'simulation' done in this way will always have a bias.
In order to get correct statistics, you must actually compute the statistics.
Re:You can't do statistics with a random # generat (Score:5, Informative)
Re:You can't do statistics with a random # generat (Score:1, Informative)
Re:If its so likely, they why hasn't it happened? (Score:5, Informative)
A good illustration of this is the so-called "birthday paradox", which asks what's the probability of having duplicate birthdays in a group of n people (whose birthdays are independent of each other). Think of adding the people to the room one by one. The first person doesn't have any chance of having a duplicate birthday, because there's nobody else in the room. The second person has 1/365 chances of duplicating, 364/365 of missing the first one. Let's follow up on the misses, they're easier to work with. In general, if we've got k people in the room without a duplicate, that means they've used up k of the 365 days in the year, and the next person we introduce to the room has to miss all of those days to avoid a duplication. So the probability of everybody missing everybody else, by the time we get up to n people in the room, is (365/365)*(364/365)*(363/365)*...*((365-n+1)/365), which starts diving towards zero really fast. The probability of having one or more duplicates is 1 - P(no duplicates), which correspondingly climbs to one really fast. If you write a short program to do the exact calculations, you'll find that by the time you have 23 people in the room the probability is greater than 0.5 of having a duplicate, and by the time you get 57 people it's greater than 0.99!
If you pick one particular person and ask what's the probability of duplicating that birthday it remains quite small. That's the difference between having a particular rare event rather than having some rare event. For a large enough group, some pair of people will almost surely share a birthday but the odds of it being you (or any other designated person) remain quite small.
Just to preserve my computing geek cred, this is why you need collision resolution for hashing algorithms. You don't know which entries will share hash values, but collisions are almost certain to happen by the time you've loaded 3 * sqrt(Hash Table Capacity) values, e.g., if your hash table has capacity 10000 you will almost surely see a duplicate within the first 300 entries.
Does Joe DiMaggio's Streak Deserve an Asterisk? (Score:5, Informative)
http://abcnews.go.com/Technology/WhosCounting/story?id=3694104&page=1 [go.com]
Disclaimer: I'm not an American, so I know next to nothing about baseball - and care less!
Re:You can't do statistics with a random # generat (Score:1, Informative)
The problem with pseudorandom number generation tends to be dependence between samples (barring a more serious bug, which has happened... but this is always a problem, and there can also be bugs in the rest of the code anyway). Now this correlation is a problem for cryptography maybe, since there is intense interest in every bit of entropy in a very short signal, and a lot of clever guys hacking at it.
However in statistics, you basically just use the random numbers as "fuel" for a sequence of very stupid computations (more or less, glorified averages and averages of squares, &c.). The functions used in statistics are just too stupid to find out that the numbers have inter-dependence, so that they tend to give the same results for pseudo-random numbers as for real numbers. This is thanks to a lot of hard work from many fields, to improve pseudorandom number generators.
In fact, and as a tangent, theoretical computer scientists tend to believe that any randomness in an algorithm can be replaced by deterministic functions! (although they don't believe this as widely as they believe P!=NP). Since we can consider any statistical procedure an algorithm, the effect (at least philosophically) that this would have on many applied fields is mind-boggling. I would love to a proof and some general techniques for this "derandomization" - if there were one, we could finally absolve ourselves of our state of sin [wikiquote.org]. (It would also imho inform the "free will" debate a bit.)
Re:You can't do statistics with a random # generat (Score:5, Informative)
Computers do not actually generate random numbers
That'll be a surprise to the multiple true random number generators build into most operating systems. There's many sources of random data in a computer. Timing between keystrokes, timing of mouse movements, network latency between packets, and of course hardware random number generators that use thermal noise as its source.
So to put it mildly, computers can, and DO generate truly random numbers that are completely unpredictable and free from bias.
(Oh, BTW, to do a Monte-Carlo simulation (which the referenced article is) you actually don't need true random numbers, you only need a pseudo-random source that's free from bias. Those pseudo-random sources do exist, and aren't that even that difficult to code.)
Comparison of Sports (Score:2, Informative)
Re:You can't do statistics with a random # generat (Score:4, Informative)
Unless you are dealing with quantum effects, you are not dealing with something truly random.
From wikipedia on "electronic (thermal) noise":
In any electronic circuit, there exist random variations in current or voltage caused by the random movement of the electrons carrying the current as they are jolted around by thermal energy.
Is that quantum mechanical enough for you?
As for network latency between packets, while it may not be random on a quantum-mechanical level, it's still unpredictable unless you can get on the same lan segment as the target computer. The keyboard timings are taken on a small enough time scale that they're quite unpredictable, and not related to the typist.
Re:Nerves (Score:2, Informative)
Wikipedia to the rescue (http://en.wikipedia.org/wiki/Clutch_(sports) [wikipedia.org])
The rest of the page does well with some very good examples from baseball - including Derek Jeter and Reggie Jackson (both Yankees - maybe the author was a Red Sox fan).
Re:You can't do statistics with a random # generat (Score:3, Informative)
Modern Intel motherboards (i810 forward) and AMD motherboards (768 forward) have a hardware RNG (Random Number Generator) that IIRC is based on diode noise. That's straight up quantum randomness, and most modern Linux distros automatically detect and use it if available.