Predicting The NCAA Tournament Using Machine Learning Methods | Stat Geek Idol

March 18, 2012 - by Kenneth Deakins

***NOTE*** This is a Round of 64 entry in our inaugural Stat Geek Idol contest. This article was conceived of and written by Kenneth Deakins. The opinions or predictions expressed below do not represent the views of TeamRankings.com, and are solely those of the author.

What follows is part of the results of a project I did with three other people for a Computer Science class. I’ll start off with a thank you to the folks at TeamRankings who were extremely helpful in providing us with data for our project.

For our project we were attempting to predict March Madness tournament results using regular season statistics. To do this we employed a variety of Machine Learning (a branch of computer science) techniques. The two techniques that ended up working best were Support Vector Machine (SVM) and K-Nearest Neighbors (KNN). I’ll try to give a short explanation of how both techniques work.

How We Created The Models

However, before I explain how the algorithms work, to start off we had to convert our data into a format we could use. Basically, we converted a game into the following format: team A’s regular season statistics, team B’s regular season statistics, and the difference between team A’s and team B’s statistics. Then label this 1 if team A wins, and label it -1 if team B wins. So for each year we used (1997-2011), we ended up with 63-67 of these vectors. Once the games were in this format, we could then use KNN and SVM to predict the results of a tournament.

KNN works in the following way: if you take an inputted game that hasn’t been label yet, and compare it to a set of games that has been labeled. The algorithm then takes the K(some integer) most similar vectors (based of distance) and takes the average label of those. If k = 3, and the closest 3 are labeled 1,1,-1, then the algorithm would label the inputted vector 1 (team A wins). One of the things we had to do was tune the value of K to select for most accurate results, our optimal value ended up being 171. To predict a year of games for KNN, we would compare each game in that year to all games from every other year we had data for. Our individual game prediction accuracy for KNN was 69.68% of games predicted correctly. As a baseline comparison, picking the favorite for the same set of games results in an accuracy of 71.39%.

While the process of KNN is relatively intuitive and easy to implement, SVM is quite complicated to implement. The important thing to know is that it trains on data from years other than the one it is predicting, and creates a weighted vector where each statistic is given a weight. This weight vector is then multiplied by the statistic vector, to get a positive or negative value. SVM had an overall individual game prediction of 70.09%, which was also not quite as good as picking all favorites.

The Exciting Part: Predictions

For this year, both KNN and SVM predicted rather chalky brackets. They picked a combined 4 upsets in the first round. In addition, both of them have an Elite 8 composed of entirely 1’s and 2 seeds. Here are some details.

Upsets

While both algorithms do tend to pick non-upsets, they do have their share of aggressive picks. (For example, in 2007 SVM predicted that Villanova (a 9 seed) would make it to the final game. However, there are some potential upsets that they both highlight. KNN predicts that Belmont (14), will beat Georgetown in round 1, and then win against San Diego State. SVM predicts that Texas (11) , Purdue (10), and NC State (11) will win in round 1.

Close Games

In addition to games that are predicted to win, both KNN and SVM have the useful property that they assign weights to each game that can be looked at as level of confidence. KNN’s range is between 1 and -1, and SVM’s range being roughly normally distributed around 0. The first round games that have the smallest confidences for SVM are: Connecticut vs. Iowa State (.226), Gonzaga vs. West Virginia (.088), Creighton vs. Alabama (.224), and Georgetown vs. Belmont (.085). For comparison, Kentucky vs. Western Kentucky has a confidence of 4.87. For KNN, the closest games in round 1 are: Iowa State vs. Connecticut (-.027), Marquette vs. BYU(.060), Temple vs. California (.0042), San Diego State vs. N.C. State (.014), and Georgetown vs. Belmont (-.036). For comparison, Kentucky vs. Western Kentucky has a value of .79.

Champion & Top Contenders

In terms of predicting the overall winner, both KNN and SVM have similar results. They both predict the same final game of Kentucky vs. North Carolina, though they differ on the final result. KNN predicts North Carolina, while SVM predicts Kentucky. Using custom matchups, the next strongest team appears to be Ohio State. Beyond Ohio State, Missouri seems to be strong as a 2 seed, with a potential Missouri vs. Michigan State match up being very close (KNN having it as the closest matchup of the entire tournament, at 9.77e-4).

Weakest Top Seeds & Their Toughest Matchups

Duke has the lowest confidence of all the 1’s and 2’s for a first round match up, with a result of 1.44 for SVM (its the only one below 2). However, Kansas potentially faces the toughest second round match up, having a confidence of .364 (SVM) against Purdue. Of the 1’s and 2’s Duke potentially faces a much tougher third round match-up against Baylor, with a confidence of only .031 (SVN), which is pretty close to even. Of the one seeds, the only one with a sub 1 SVM match-up before the elite eight is Syracuse, who has two. SVM gives Syracuse a .756 confidence against Kansas State, and a .460 confidence against Wisconsin.

Room For Improvement But Still A Success

Overall, while not being superior to simply picking all favorites in terms of overall accuracy, we were still pleased with the results. We achieved an accuracy only slightly smaller than picking all favorites. One of the really interesting things about this is that it allow one to make predictions without actually having any knowledge about basketball. SVM essentially decides for itself what statistics are important (the learning aspect of machine learning), which is the ultimate strength of Machine Learning as opposed to other techniques.