Slashdot Log In
Alternate Baseball Universes
Posted by
kdawson
on Sunday March 30, @05:07PM
from the say-it-ain't-so-joe dept.
from the say-it-ain't-so-joe dept.
Jamie found a NYTimes op-ed by a grad student and a professor from Cornell, outlining some research they did into alternate baseball universes. The goal was to find out how unlikely in fact was Joe DiMaggio's 56-game hitting streak, played out in the 1941 season. No one since has even come close to that record. The math guys ran simulations of the entire history of baseball from 1885 on — 10,000 of them. For each simulation they put each player up to the plate for each at-bat in each game in each year, just like it happened; and they rolled the dice on him, based on his actual hitting stats for that season. (Their algorithm sounds far simpler than whatever the Strat-O-Matic guys use.) The result: Joltin' Joe's record is not merely likely, it's basically a sure thing. Every alternate universe produced a streak of 39 games or better; one reached 109 games. Joe DiMaggio was not the likeliest player in the history of the game to accomplish the record, not by a long shot.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Full
Abbreviated
Hidden
Loading... please wait.

If its so likely, they why hasn't it happened? (Score:5, Interesting)
I know the statisticians among you are going to bash me with a cluestick for such a naive question, but I'll ask anyway - if this event is so likely to occur, then why hasn't it happened again?
Reply to This
Re:If its so likely, they why hasn't it happened? (Score:4, Insightful)
Reply to This
Parent
Re:If its so likely, they why hasn't it happened? (Score:5, Insightful)
This sort of a study is really more about curiosity, it doesn't deal with things like changes to the way in which the game is played. For instance early on, and for quite a while later, it was common for a pitcher to pitch 9 innings every game, and in many cases to pitch both games out of a double header. Meaning more opportunity for errors and since batters get time to rest up, there's a bit of an edge under that style of play to the batter which doesn't exist today.
That also doesn't include the variety of pitching which players see today or the fact that a player might get to see 3 different pitchers in a single game.
Even the length of the season has an effect on how players play. None of those things are easily quantified, much less analyzed by statisticians.
Reply to This
Parent
Re:If its so likely, they why hasn't it happened? (Score:5, Insightful)
Reply to This
Parent
Re:If its so likely, they why hasn't it happened? (Score:5, Funny)
I think.
Reply to This
Parent
Re:If its so likely, they why hasn't it happened? (Score:5, Insightful)
Reply to This
Parent
Re:If its so likely, they why hasn't it happened? (Score:5, Insightful)
I wish my mod points hadn't just expired, because you just summed it up perfectly.
Really? For the purposes of this article, why?
It seems perfectly reasonable to me to take a set of data and try to model how likely a particular outcome is. That's a very valid question to ask that a statistical model can answer. The model may be flawed, need improvement, or whatever, but that doesn't mean the question isn't one that can't be answered by science.
If you invest in it, I guarantee a large return, because complex systems that rely heavily on myriad human variables are of course determined entirely by statistics.
This is simply an invalid analogy. The article isn't saying it can predict the future (or even the past!) based on a statistical model. All it's saying is "just how likely was it for DiMagio to get his streak, given past performance".
Reply to This
Parent
Re:If its so likely, they why hasn't it happened? (Score:5, Informative)
A good illustration of this is the so-called "birthday paradox", which asks what's the probability of having duplicate birthdays in a group of n people (whose birthdays are independent of each other). Think of adding the people to the room one by one. The first person doesn't have any chance of having a duplicate birthday, because there's nobody else in the room. The second person has 1/365 chances of duplicating, 364/365 of missing the first one. Let's follow up on the misses, they're easier to work with. In general, if we've got k people in the room without a duplicate, that means they've used up k of the 365 days in the year, and the next person we introduce to the room has to miss all of those days to avoid a duplication. So the probability of everybody missing everybody else, by the time we get up to n people in the room, is (365/365)*(364/365)*(363/365)*...*((365-n+1)/365), which starts diving towards zero really fast. The probability of having one or more duplicates is 1 - P(no duplicates), which correspondingly climbs to one really fast. If you write a short program to do the exact calculations, you'll find that by the time you have 23 people in the room the probability is greater than 0.5 of having a duplicate, and by the time you get 57 people it's greater than 0.99!
If you pick one particular person and ask what's the probability of duplicating that birthday it remains quite small. That's the difference between having a particular rare event rather than having some rare event. For a large enough group, some pair of people will almost surely share a birthday but the odds of it being you (or any other designated person) remain quite small.
Just to preserve my computing geek cred, this is why you need collision resolution for hashing algorithms. You don't know which entries will share hash values, but collisions are almost certain to happen by the time you've loaded 3 * sqrt(Hash Table Capacity) values, e.g., if your hash table has capacity 10000 you will almost surely see a duplicate within the first 300 entries.
Reply to This
Parent
Re:If its so likely, they why hasn't it happened? (Score:4, Informative)
Reply to This
Parent
Re:If its so likely, they why hasn't it happened? (Score:5, Interesting)
Otherwise, buddy, you're way off base.
NL year-by-year stats. [baseball-reference.com]
Look at those ERAs pre-1920. Before 1920, the ERA on the NL never significantly exceeded 3.00. After 1920, it never dropped below 3.3 or so, with the exception of a 2.99 in 1968, after which MLB made changes to the rules, amongst them lowering the acceptable height of the pitcher's mound.
The time prior to 1920 was marked by pitchers such as Cy Young, Mordecai Brown, Walther Johnson, Ed Walsh, Christy Mathewson. You've probably heard of most of them.
Here are the single-season MLB ERA leaders. [baseball-reference.com] Outside of Bob Gibson in the aforementioned 1968, you have to go all the way to Greg Maddux in 1994 at #48 all time to find a season after 1920 on the list. Barely 10 of the 100 lowest single-season ERAs in MLB history occurred after 1920. And that's only because Pedro Martinez in 2000 and Ron Guidry in 1978 tied with 9 others for #100 on the list. So only 8 of the best single-season ERAs happened after 1920.
You need to research "dead ball era", and the response by baseball to "Black Sox". (Hint: just like the response to the 1994 strike, it involves the ball...)
The fact that you got a +5 out of such a demonstrably incorrect post is a major indictment of the baseball knowledge of the Slashdot faithful.
Reply to This
Parent
Re:If its so likely, they why hasn't it happened? (Score:5, Insightful)
:::The early years tended to be batting competitions (in some ways like today's) rather than pitching competitions
::If by "early years", you mean 1920 and later, yeah.
:Otherwise, buddy, you're way off base.
The only one off base is yourself -- check your own link (baseball-reference.com is an amazing site and I recommend it to anyone) and pay extra attention to the 1890s. In the years immediately after the pitcher's mound was moved back to its current 60 feet 6 inches, offensive totals soared far beyond what we're used to seeing. Given that you're familiar with the lowering of the mound for 1969, I'm surprised that you're not familiar with when it was fixed at its current distance.
The article even mentions that the record was most likely to have been set in 1894, when the National League ERA was well over 5.00, and there were 11.6 hits per team per game, more than 20% more than we see now.
Look at those ERAs pre-1920. Before 1920, the ERA on the NL never significantly exceeded 3.00.
I'm looking at them. The "5.32" for 1894, which is somewhat more than three, is particularly striking.
After 1920, it never dropped below 3.3 or so, with the exception of a 2.99 in 1968, after which MLB made changes to the rules, amongst them lowering the acceptable height of the pitcher's mound.
...
You need to research "dead ball era", and the response by baseball to "Black Sox". (Hint: just like the response to the 1994 strike, it involves the ball...)
While he's doing this, perhaps you could research what came before the dead ball era: namely, the high-offense 1890s. Teams were taken off guard by the increase in the pitching distance and continued to play an 1880s game in a new environment. It took several seasons for adjustments, such as four-man pitching rotations and the occasional use of relief pitchers, to balance the sudden advantage that had been given to the batters. It is not surprising that 1894 would be the year in which a long hitting streak would have been most likely -- the single-season record for runs scored, 194 by Billy Hamilton, was set that year and still stands today.
The fact that you got a +5 out of such a demonstrably incorrect post is a major indictment of the baseball knowledge of the Slashdot faithful.
No, Martin is right -- the 1890s, while not as famous as Ruth and Gehrig's 1930s, were one of the most offensive eras in baseball. His simple analysis is much more forgivable than the insults you throw his way even while being completely ignorant of an entire decade of baseball history, the data from which are right on the web page you so callously direct him to visit.
Reply to This
Parent
Nerves (Score:5, Insightful)
Reply to This
Re:Nerves (Score:4, Interesting)
Reply to This
Parent
How to Make Baseball Even MORE Boring? (Score:4, Funny)
Reply to This
Re:How to Make Baseball Even MORE Boring? (Score:5, Funny)
Reply to This
Parent
Re:How to Make Baseball Even MORE Boring? (Score:5, Funny)
Talk about a great way to make an awkward social event even more awkward
Reply to This
Parent
i came here to make an insightful comment (Score:5, Funny)
there you will find that this comment contains something worthwhile reading. sorry
Reply to This
Changing game of baseball (Score:5, Interesting)
I think it would be more impressive to take a subset of the data, and compare from 1930 up until the present. Of course, there have been other major changes to; glove sizes, introduction of the slider for a pitch, steroid use.
Reply to This
too simplistic (Score:5, Insightful)
The problem is this doesn't control for variances in the quality of pitching. The chances of going that many games without running into a hot pitcher isn't accounted for.
Imagine you average a 75% chance of getting a hit in any individual game. If you face three average pitchers, your chances are (.75)^3 but if you face a good pitcher an average pitcher and a bad pitcher it might be (.5)(.75)(1.0) which gives a different probability, despite the same average number of hits.
In order to be realistic the calculation would need to account for the deviation from average in the ability of the pitchers (which would likely be higher 100 years ago because of fewer player and segregation, and now because of expansion, as compared to the 1950s)
What they don't report is how often there are long (but not record) streaks in their model, so there is no way of knowing how accurately it reproduces reality.
Reply to This
Too many assumptions? (Score:5, Informative)
In reality, a league is typically very imbalanced from team to team and from pitcher to pitcher (probably even more so in the game of the early 20th century than now). It's easier to get hits off of two successive average pitchers than it is to get hits both off of a very good and a very bad pitcher. For example (to oversimplify a good deal):
Say the league is split 50/50 between "good" pitchers (pitchers you'll get a hit off of 50% of games) and "bad" pitchers (pitchers you'll get a hit off of 80% of games). In a typical 20 game stretch, you'll encounter 10 good pitchers and 10 bad ones, and your odds of getting a hit in all 20 games would be (0.50)^10(0.80)^10, about 1/9537.
Under their analyis as I understand it, they'd replace all the pitchers by mediocre pitchers who you'd get a hit off of 65% of the time, and your odds would be (0.65)^20, about 1/5517.
This one assumption almost doubled your chances of getting a hit in all 20 games.
There are other biases as well going the other way (ignoring the effect of hitting slumps, for example), but this one jumped out at me.
Reply to This
Anohter unreported weird fact (Score:5, Funny)
Reply to This
Does Joe DiMaggio's Streak Deserve an Asterisk? (Score:5, Informative)
http://abcnews.go.com/Technology/WhosCounting/story?id=3694104&page=1 [go.com]
Disclaimer: I'm not an American, so I know next to nothing about baseball - and care less!
Reply to This
Re:Bogus (Score:4, Interesting)
What they are actually saying is that reality appears to follow a probability bell curve.
You could also say that, in 1,230,000 years of baseball games, we could be almost certain of a hitting streak longer than 56 games.
Reply to This
Parent
Re:You can't do statistics with a random # generat (Score:5, Informative)
Reply to This
Parent
Re:You can't do statistics with a random # generat (Score:5, Informative)
Computers do not actually generate random numbers
That'll be a surprise to the multiple true random number generators build into most operating systems. There's many sources of random data in a computer. Timing between keystrokes, timing of mouse movements, network latency between packets, and of course hardware random number generators that use thermal noise as its source.
So to put it mildly, computers can, and DO generate truly random numbers that are completely unpredictable and free from bias.
(Oh, BTW, to do a Monte-Carlo simulation (which the referenced article is) you actually don't need true random numbers, you only need a pseudo-random source that's free from bias. Those pseudo-random sources do exist, and aren't that even that difficult to code.)
Reply to This
Parent