***NOTE*** This is a winning Round of 64 entry in our inaugural Stat Geek Idol contest. This article was conceived of and written by Tyler Williams of Causal Sports Fan. The opinions or predictions expressed below do not represent the views of TeamRankings.com, and are solely those of the author.
A few years ago, I ran my office NCAA pool. Right at the deadline, a Swiss economist that I worked for came over, bracket and sheepish grin in tow. He knew nothing about basketball, but someone had explained the seeding system to him. Being an economist, he optimized based on the only inputs he had: he filled out the bracket purely by seed (choosing randomly between the one seeds in the final four). He finished second, of course, which was almost as bad as the year my wife won my pool by choosing teams from her favorite places.
Maybe this Swiss fellow saw through the charades. How predictive is seeding after all? Since 2007, the higher seed has won about 72% of all tournament games (picking the winner randomly when seeds are the same). The favorite by the gambling spread has won 73% of the time, so seeding does quite well.
This year, I set out to match this performance using team statistics from the regular season (all stats are for the 2007 through 2011 seasons). I’ll have some upset picks at the end, but you have to do the leg work with me first.
Working Towards A Model
The question to ask is: what really matters in a basketball game? The team that scores more points wins. That seems like a fair (and obvious) starting point. In fact, the team with the higher average point differential during the regular season wins about 68% of tournament games (using Pythagorean expectations doesn’t improve this number). Not bad, but not as good as seeding.
Some teams play harder schedules than others, though. A simple proxy for the strength of your schedule is your conference. I defined three levels of conferences: the biggies (Pac 12, Big Ten, Big 12, Big East, ACC, SEC), the middle (Atlantic 10, CAA, CUSA, Horizon League, Mountain West, WCC), and the dregs. Conference association alone predicts 63% of games correctly, but the real gains come from the combination. Using simple linear regression to predict point differential as a function of the strength of your conference and your average point differential gets the winner right 71% of the time.
That’s getting pretty close to the seeding predictions, without using much information about the teams at all. It’s also a little boring though. There’s nothing specific about matchups in these predictions, and match ups are what make games so compelling in the first place.
Using Efficiency Stats To Simulate The Games
My last approach works on the match up angle using regular season efficiency (i.e., per possession or per play) stats for each team to predict statistics in the tournament.
Per game stats (points, rebounds, turnovers, etc.) aren’t great because teams play at different speeds. This means that some teams get more possessions to rack up stats than other teams. When two teams play, they have the same number of possessions by definition. What really matters is how many points a team scores per possession, which is driven by just a few variables: shot selection, shooting percentage, offensive rebound percentage, and turnovers per possession.
I use the values of these variables for each team in simple regressions (along with conference strength) to predict their values in each game. Then, I simulate each game 50 times, using the efficiency variables to determine the likelihood of each outcome (e.g., made three pointer, turnover, offensive rebound after a miss) on each play.
Using the average point differential across the 50 simulations to predict the winner gets 73% of past tournament games correct! A small improvement, but I’ve matched the performance of seeding.
What The Simulation Says About 2012
Using the results from past regressions, I can estimate the expected point differential for each of the games this year. I input the efficiency stats for this year’s teams, and the coefficients from my regressions output the expected stats for the tournament. I simulated all the first round matchups using these stats, and your official upset picks are: St. Bonaventure over Florida St., St. Louis over Memphis, and West Virginia over Gonzaga!
[Editor's Note: This post was written before the first round games took place, but we were unable to review and publish it until afterwards.]