Predicting 2012 NCAA Tournament Upsets | Stat Geek Idol

***NOTE*** This is a winning Round of 64 entry in our inaugural Stat Geek Idol contest. This article was conceived of and written by John Ezekowitz of the Harvard Sports Analysis CollectiveThe opinions or predictions expressed below do not represent the views of, and are solely those of the author. We unfortunately did not get a chance to review this entry until after the start of tournament, otherwise we would have published it prior to the first round.

For the past two years, I have attempted to systematically predict First Round NCAA Tournament upsets using a dataset of match-ups from 2004 onward. Last year, I improved the model by adding opponent Four Factors data, and the model correctly identified Marquette (11 seed), VCU (11 seed), and Richmond (12 seed) as the three teams most likely to pull off upsets.

So what does this model say about this year’s March Madness bracket? Read on to find out.

What is the Model?

While the performance was no doubt lucky – the odds of all three of those teams pulling upsets according to my model were roughly 1 in 5 – I think the model does to a good job of identifying factors that are important in the one-game setting of the NCAA tournament. The backbone of the model is based on turnovers and rebounding. Specifically, I use turnover rates and rebounding rates to eliminate the potential bias of pace.

Turnovers seem to be very important in March: if you take care of the ball and take it away from your opponent, you are in effect creating extra opportunities to score. Rebounding, too, is important because better defensive rebounding minimizes the opponent’s opportunities to score and better offensive rebounding maximizes a team’s scoring chances.

In addition to these factors, the model also includes measures of Strength of Schedule. This in effect proxies for seed (the two are highly correlated), but also includes valuable information about when teams are mis-seeded (see Wisconsin in 2010, for instance).

Finally, the key to the model is the match-ups. A team with a solid upset profile might be derailed by facing a juggernaut who also takes care of the ball and forces a lot of turnovers.

The Probit Model

This model had 256 observations and a Pseudo R^2 of 0.38. The result variable is coded 1 for a win, and 0 for a loss, so coefficients that are negative mean that the variable is negatively associated with odds of winning. As you can see, a higher turnover percentage predicts lower odds of winning a first round game. The SOS coefficients look larger because they are scaled from 0 to 1, not 0 to 100 as the other variables are.

In out of sample testing, which was conducted by removing a year from the dataset, running the model, and using that model to estimate probabilities for the games in that year, the model proved to be fairly conservative. It identified 21 “underdog” teams seeded 11-14 that had greater than 50 percent odds of winning in the last nine years. 15 of those teams went on to win, and six did not. Thus the model is good at avoiding false positives.

What it is not as good at, however, is avoiding false negatives. Over that same timespan, 15 other teams pulled off upsets where the model predicted the better seed to win. 40 minutes under the bright lights is a small sample, and no predictive system will be perfect. I am happier with a model that is conservative and predicts fewer upsets more accurately than a model that predicts too many upsets out of sample.

2012 Upset Predictions

If you look at the Vegas odds for this year’s first round, you’ll notice something striking: the odds for the 13 and 14 seeds are a lot shorter this year than most years. The 3 and 4 seeds are not as big favorites: Georgetown, Michigan, Florida State, and Indiana are all favored by six points or fewer, whereas last year 3 and 4 seeds were favored by an average of 11 points.

The 2012 crop of 13 and 14 seeds are especially tough. That is borne out in the model predictions above.

According to this model, every 14 seed has at least a 30 percent chance of pulling an upset. That is ridiculously high compared to previous years.

As you can see, the model likes VCU, NC State, Ohio, and Cal (provided they beat USF in the First Four game) to pull upsets. New Mexico-Long Beach State is essentially a coin flip in this model, and Florida State looks vulnerable because of their terrible turnover rate and bizarre inability to rebound defensively.

VCU will be a very trendy upset pick this year because of their run last season, but this model adores the Rams because of their ability to force turnovers (best in the nation) and take care of the ball (27th best in the nation). A note of caution, however: VCU is a big outlier. Since this model is looking for average effects, it may not do as well out of sample predicting teams far away from the average, like VCU. The Rams fit the profile of a tournament David, but between their popularity and their extreme profile, understand that the expected value of picking them might be lower than this model implies.