Using Graph Theory To Predict NCAA Tournament Outcomes 91
New submitter SocratesJedi writes "Like many technically-minded people, I don't have a lot of time to keep up with sports. Nevertheless, trying to predict the outcome of the NCAA men's basketball tournament is a fun activity to share with friends, family and colleagues. This year, I abandoned my usual strategy of quasi-randomly choosing teams and instead modeled the win-loss history of all Division I teams as a weighted network. The network included information from 5242 games played during the 2011-2012 season. From this, teams came be ranked using tools from graph theory and those rankings can be used to predict tournament outcomes. Without any a priori information, this method accurately identified all the #1 seeds in the top 5 best teams. It also predicts that at least one underdog, Belmont (#14 seed), will reach the Elite Eight. Although the ultimate test will be how well it predicts tournament outcomes, initial benchmarks suggest 70-80% accuracy would not be unreasonable."
Re: (Score:1)
Re: (Score:1)
See AAPL's PE vs growth rate for example.
You mean how low their PE is? By rights their stock should be up around $600/share.
Re: (Score:3)
Yes. If fairly valued at a PE of say 25 or so (which is still low for their growth rate), their stock should be at $875 or so.
MOT, INTC, EMC, JNPR are all similarly valued. But have much lower growth rates.
BIDU is the only large tech company with a similar growth rate. It's PE is 46, which would put AAPLs stock price at $1615.
VMware has lower growth, but a PE of 60. AAPL would be at $2100 if similarly valued.
http://www.google.com/finance#stockscreener [google.com]
past history (Score:5, Insightful)
wouldn't running the algorithm against past years' records and testing against past tournament results be the best possible test to tune the algorithm?
Re:past history (Score:5, Insightful)
The problem stems from the fact that we traditionally predict a team will win if it is a stronger or better team, and we use our graph theory to produce relative team ratings. And if each game of the tournament were played over and over again with the winner of the majority going to the next round, then our methods would work even better. As it stands though, we are trying to predict a single sampling from a probability distribution - which will necessarily have error. Informally, the real tournament has upsets (when a weaker team beats a stronger one). Our algorithms can't predict these, the best they can do is gain a better understanding than humans as to which team is better.
Add to that the fact that the tournament is structured hierarchically - a mis-prediction in the first round prevents you from even attempting to predict later games (and by NCAA bracket scoring, that counts the same as mis-predicting those later games). So early upsets can potentially have large negative outcomes on brackets.
Our algorithms can't predict these, (Score:2)
Yeah, like when someone intentionally throws a game. As long as people are gambling (somewhere) and money is to be made, there is an opportunity and incentive to cheat. Get your graph theory to account for that!
Or maybe regression analysis is better like Levitt used to find cheating with Sumo wrestling and US student test takers in his book Freakonomics. (Awesome book BTW) ;)
Re: (Score:2)
Re: (Score:2)
And the cited estimate of 70-80% accuracy seems made up. People who research the field know that there is far less certainty than that. At something like 20% confidence, your prediction should be something like 20%-90%.
If a coin flip is 50% accurate, than an extra 20% accuracy will give you 70%.
Re: (Score:2)
There are only two teams per game, modeling that with a coin flip makes a lot more sense than modeling it with a die. A random chance will give you 50% accuracy at picking the winner. You have to do better than 50% accuracy to have any claim at success at all. The real question is, what was the GP talking about when he claimed that success rates between 20% and 90% were more realisitic. Why even try if your algorithms can't beat random chance?
Re: (Score:1)
Here in the UK betting is perfectly legal and Betfair (a betting exchange that allows people to take either side of a bet) has a nice API that lets you back or lay most sporting events. People use very sophisticated algorithms to work out the in play odds of football matches, adjusting them second by second as the game goes along.
As a h
Re: (Score:2)
50% is not as low as you go, because of the way brackets are scored. You predict the outcome of *all* the games in the tournament before *any* games are played. Which means that errors in the first round mean that you haven't even properly predicted who is playing in the second round. If the team you picked as winning a game doesn't even play that game, then you automatically lose.
If we simplify the tournament, we can preten
Re: (Score:2)
Re: (Score:1)
Re: (Score:2)
The problem stems from the fact that we traditionally predict a team will win if it is a stronger or better team, and we use our graph theory to produce relative team ratings. And if each game of the tournament were played over and over again with the winner of the majority going to the next round, then our methods would work even better. As it stands though, we are trying to predict a single sampling from a probability distribution - which will necessarily have error. Informally, the real tournament has upsets (when a weaker team beats a stronger one). Our algorithms can't predict these, the best they can do is gain a better understanding than humans as to which team is better.
It's not just the single game problem - and even if you set aside upsets, the "stronger" team doesn't always win because as the coaches have been saying for years, it's about matchups. Teams have strengths & weaknesses - style of play, offensive & defensive skill sets of individual players, etc. A team with a tremendous front court but weak ball handlers is more likely to lose to a inferior team that has a high pressure trapping defense whereas it might beat a stronger team that doesn't use the same
Predicting the top is easy (Score:5, Insightful)
Everyone knows who the big names are who are likely to make it to the final four. It's predicting how things will go at the middle and bottom, where teams are much more likely to be evenly matched, that's really hard.
Re: (Score:3)
You mean like penciling in Butler for the championship two years in a row? Or the final four matchup of Butler vs VCU?
70-80%? (Score:2, Informative)
Okay, you can get 50% accuracy just by flipping a coin.
If you go with "the higher seed wins", you get to 85% or so. Color me unimpressed.
Re: (Score:3)
Should be lower seed (I am the AC).
Re:70-80%? (Score:5, Informative)
And my numbers are off. In 2011, 43 times out of 63, the lower seed won for about a 68% win rate.
Re: (Score:2)
Re: (Score:1)
Just take last years results (Score:2)
You can get very reasonable results by just taking last years results. This works for most sports.
Re:Just take last years results (Score:5, Insightful)
That may work for pro sports, but not for college sports. In fact, because teams usually lose their nucleus after winning it all (players declare for the draft), it is rare for a team to make it to the final game two or more years in a row.
Re: (Score:1)
I believe the point was that he could take last years season and build his dataset around the regular season games, create a bracket, and then match his bracket against the winning results.
Re: (Score:3)
Re: (Score:3)
How is this news? (Score:2)
Re: (Score:3)
It's not. This is just a puff piece trying to drive hits to their site by mentioning the NCAA tournament.
As a sports fan (Score:4, Interesting)
Some problems I see. Disclaimer: I know there's a margin of error here as the author said, and I know my observations will be based largely on anecdotal evidence, making it inferior. But if sports were so easy to predict there would be no sports gambling.
- That's probably too far for Belmont; a #14 has only ever gotten as far as the Sweet 16, twice (Cleveland State '86, Chattanooga '97). Lowest seed to make an Elite 8 is Missouri in 2002 as a #12 . Belmont is actually going to be one of the more popular upset picks, but they would have to upset two far superior teams twice in 3 days.
- It's a bit too "chalk". #1 seeds generally survive the first two games (undefeated against #16's, 55-14 v. #8's, 59-6 v. #9's), but the #2's have it worse (only four losses v. #15's, but 58-21 v. #7's and 29-21 v. #10's). I know two #12's, a #13 and a #14 doesn't seem like "chalk" but historically it's much more likely that we'll see more #5-7 or #10-11's. To have only one #2 not make the Elite 8 and all the #1's would be almost unheard of.
- A #12 always beats a #5, but three of them doing so in one year would seem unlikely, as they're only 39-89 overall.
- Some of the other first round matchups seem a bit improbably. It has every #6 and every #7 winning, for example.
Re: (Score:3)
I didn't read the article (yet), but I put together a game result predictor a couple of years ago that I ran against the tournament field with about an 83% success rate for the whole tournament. It was in the 93% range for the first two rounds. My algorithm utilized season long team statistics to get a team's baseline and then incorporated strength of schedule and seeding components. Just like you mentioned about how far a team has historically progressed from a specific seed, I used historical analysis of
Re: (Score:3)
Re: (Score:2)
if you're model is good enough, you should be able to make money on sports betting on it.
Not against people whose model is just as good, and not (over the long term) against any professional gambling enterprise (legal casino or bookie) set up to profit whether you win or lose. A professional gambling outfit either takes a cut that negates all statistical advantages of having a good predictive method or they set up the odds so that they'll make back what they lost to you last time when you lose to them next time. The only people who make money reliably in gambling are those who have found a suck
Re: (Score:3)
"Like many technically-minded people, I don't... (Score:1)
... have a lot of time to keep up with sports."
Yes, if you enjoy sports, you must not be technically-minded. Tis for the plebes...
But, I bet you have time for Skyrim!
Re: (Score:3)
At least in Skyrim, you're an interactive participant. That, and Skyrim isn't just a polite way for people to act out their base tribalistic instincts.
Re: (Score:1)
Morale of the story... (Score:2)
Re: (Score:3)
Re: (Score:2)
Re: (Score:3)
Ah, you would think that the casino sports book odds were the most accurate availibe and only determined by scientific study of the sports.
BZZZT! Wrong. Casinos need to make a profit. So they determine the *initial* odds by studing the sport, but then change the odds in reaction to the bets that are placed. They try to have equal amounts on both sides of a bet. They pay less to the winners than they get from the losers.
What's the point of pointing that out? Well, you have some pro gamblers who actually do m
Re: (Score:1)
The joys of single elimination (Score:3)
March Madness is notoriously hard to predict, partly because of the number of teams involved and also because of the single elimination system that I love so much. Its prevalent in few sports and makes each game mean a lot more, also opening the door for cinderalla to take her 15 minutes of fame. 7-game playoff rounds like they have in Baseball and the NBA tend to nullify those outliers. I honestly think that's a big reason for the success of the NFL too - every game and every play means a hell of a lot more when the best possible record is 19-0.
Doesn't matter if it works (Score:2, Insightful)
Can you write a windows installer for it and sell it to gamblers?
Nerd (Score:2)
Like many technically-minded people, I don't have a lot of time to keep up with sports.
The word you're looking for is "Nerd". It's OK to say it, it's in the title-bar of Slashdot.
Re: (Score:3)
You know, for stereotypical nerd behaviour like communicating to each other in incomprehensible jargon and obscure references that other people don't get, obsessive behaviour, dressing up in ridiculous costumes for gatherings, etc, I've come to realize that nothing beats a hard-core sports fan.
I'll go with the squid (Score:2)
Behind the Curve (Score:1)
*YAWN* (Score:2)
That's like saying, "I did a lap in a Formula 1 car, and I'm either 15 seconds ahead of last year's world champion, or I'm a minute behind the field."
You haven't done this before, have you?
A plug for Nate Silver's FiveThirtyEight (Score:1)
His statistical reasoning is always well described, so that if you disagree with his results, at least you understand why you disagree. He's got "picks" [nytimes.com] and a description of the system [nytimes.com] used to generate them.
The original article is an interesting network analysis exercise, but it is really limited by its assumption of no a priori quality data. (Any time you beat Kentucky or North Carolina or other perennial powerhouses, that's almost always a quality win.) Sagarin and LRMC follow similar logic, but without a
Not enough time? (Score:4, Insightful)
You don't have time to follow sports, but you have time to model "information from 5242 games played during the 2011-2012 season".
You could be honest and just say you don't really care, but get involved in the playoffs because everyone else is talking about it.
I'm guessing your level 80 warlock probably doesn't 'have time' either. :)
Re: (Score:1)
Re: (Score:2)
You don't have time to follow sports, but you have time to model "information from 5242 games played during the 2011-2012 season".
How is that a contradictory statement? He's so busy doing data modeling stuff that he doesn't have time to watch sports.
When someone says they "don't have time" to do something, it's generally because they're very busy with....gasp....other things!
Re: (Score:1)
My Best Luck (Score:2)
It seems in office pools I do the best by picking favorite team colors.
Re: (Score:2)
I "won" an office pool once without even playing. I told the guy that I could win the whole thing, but, as I didn't want to take their money through gambling, I would just tell him my picks after it was closed.
The problem? He was giving points to each "winner" based on their number. If a #1 won, you got one point. If a #12 won, you got 12 points. I just picked 9-16 the whole tournament through. (He admitted that they all would have hated me had I played.) After the first round, with 4 upsets, there w
Re: (Score:2)
Regarding the bracket, the four No 1 seeds march along, undefeated, until they meet in the final four. While this can happen, it seems like a trivial and unsophisticated result to me.
The problem with these predictions is, of course, that is the most likely scenario. There are enough other things that can happen that it is probably a worse than 50-50 shot, but there isn't another scenario that is more likely. Really, all any algorithm can do to beat picking the better seed every time is try to find spots where teams are seeded either higher or lower than they should be, and the very top and bottom of the list are probably not the most likely spots for this to happen.
The problem is (Score:2)
There's too much data and too many variables. Even just inputting all the known, public data might significantly improve the accuracy, but there's also lots of unknown private data that can influence games. Algorithms like this can't account for things like the coach's son getting killed in an automobile accident the night before a game, or the star center getting hit with a bad flu. And when you make it complex enough to take in all that data, it still has to get all that data somewhere, which means it has
Yes, I see... (Score:2)
... modeled the win-loss history of all Division I teams as a weighted network. The network included information from 5242 games played during the 2011-2012 season. From this, teams came be ranked using tools from graph theory ...
... you obviously don't have enough time to keep up with sports.