College Hoops Team Similarity Scores: Mining History to Understand the Present

So much of our casual conversations about sports rely on comparisons.

“Jared Sullinger is the college version of Paul Millsap.”

“This year’s Missouri Tigers remind me a lot of 2009 Villanova.”

Oftentimes, we make these comparisons based on things that are unquantifiable: how athletic a player is, how “hard” a team plays. But given that I am a quantitative guy, I’m always trying to add rigor to the so-called eye test.

“Similarity Scores” Tell Us Which Teams Are Statistically Alike

Similarity scores are a concept introduced by Bill James for baseball. The idea is to turn a few of a team’s key stats into numbers that reflect how far above or below average the team is in each category (measured in standard deviations, this is called a “Z-score” by statisticians).

Then the same can be done for a set of historical teams,  and the differences in Z-scores between the current team and each historical team can be summed to create a similarity score for each. The lower the resulting score, the more similar the two teams are.

For example, Creighton currently leads the nation with an effective field goal percentage of 58.5 percent. That is 3.1 standard deviations above the mean for all teams from 2004 until 2012. For any teams that are below average in eFG%, the similarity score on that dimension is going to be quite large, reflecting the fact that Creighton and that team are not similar.

[Technical note: Since teams have now played roughly 75 percent of their schedules, we have a fairly good idea of who they are. While we do have some issues with the fact that the comparison teams played harder games at the end of the schedule, the Strength of Schedule adjustments should account for that difference (2012 SOS’s are not as spread out as the full year SOS’s as teams have not played as many other good or bad teams as they will when the season is over).]

A New Wrinkle: Using Variance Data

Previously, I have done these calculations using Dean Oliver’s Four Factors on offense and defense (Effective Field Goal Percentage, Turnover Percentage, Rebounding Percentage, and Free Throw Rate) using weighting similar to what can be found here. The wrinkle I am adding now is variance data.

When we think about team performance, we don’t just think about the mean level. Florida State and Kentucky both have top ten defenses in the country in per possession terms. Yet while Kentucky has been extremely consistent in holding opponents to around 0.8 Points Per Possession allowed, FSU has been all over the place, allowing under 0.75 PPP to five teams, but over 1.00 PPP to seven teams. Clearly we would not think of the Wildcats and the Seminoles as having very similar defenses.

To my knowledge, no one has ever used variances in similarity scores. I hope that the results presented below and in this series of posts will provide team comparisons that better quantify what we see with our eyes.

I thought I’d start off this series by looking at the top comparisons for four non-BCS darlings that are highly ranked this year.

Murray State Racers

(97th in TR’s Consistency Rankings)

TeamYearSimilarity ScoreRecordNCAA Result
Murray St.20100.5531-5Round of 32
East Tennessee St.20040.7627-6First Round
Iona20060.7723-8First Round
Kent St.20080.8228-7First Round

Interestingly, the 2012 Racers are most similar to their 2010 counterparts. All four of the teams listed here played weak schedules, and very consistently beat those weak schedules. Murray State’s defining characteristic is its consistency: the Racers are two standard deviations more consistent than the mean both offensively and defensively. While their four best comps were all double-digit NCAA tournament seeds and bowed out fairly quickly, MSU will likely benefit from a better seed and will have a better chance to advance further.

Creighton Bluejays

(139th in TR’s Consistency Rankings)

TeamYearSimilarity ScoreRecordNCAA Result
Utah St.20090.5230-5First Round
Pacific20050.5927-4Round of 32
Oregon20080.7218-14First Round
St. Mary's20100.7628-6Sweet 16

The Bluejays do two things far from the norm of most teams: shoot the ball better (+ 3 SDs) and fail to force turnovers (-2 SDs).  Of these comparisons, I think the Pacific comp is the most interesting. They were led by swingman Christian Maraker, who was Doug McDermott before there was a Doug McDermott. Pacific won a game in the tournament, and given Creighton’s defensive issues, that seems like a reasonable outcome for the 2012 Bluejays.

Harvard Crimson

(220th in TR’s Consistency Rankings)

TeamYearSimilarity ScoreRecordNCAA Result
Louisville20080.3827-9Elite Eight
Wake Forest20090.4224-7First Round
Stanford20040.4630-2Round of 32
Sam Houston St.20080.4723-8DNQ

I am biased here, but I had to include Harvard. The Crimson play strong defense, as their eFG% defense and defensive rebounding are both almost two standard deviations below the norm. Their stats are almost identical to the 2008 Louisville squad, but Louisville played a harder schedule. Even adjusting for that, they are Harvard’s closest comp. Stanford went 30-1 and was a 1-seed in 2004, but played in the worst major conference in recent memory and thus played the weakest schedule of any 1-seed in the last decade. The Crimson would be lucky to duplicate the success of the Cardinals, but might aim for a second round appearance like the Cardinal

St. Mary’s Gaels

(193rd in TR’s Consistency Rankings)

TeamYearSimilarity ScoreRecordNCAA Result
Wichita St.20110.2429-8Won NIT
Murray St.20040.328-6First Round
Arizona20110.3530-8Elite Eight
Florida20110.3829-8Elite Eight

The Gaels appear similar to two of the Elite Eight teams from last year’s tournament. What some might not realize about St. Mary’s is that they have a very negative covariance. That means when they play well on offense, they also play well on defense, and vice versa. This serves to magnify both their good and bad performances, and is a trait shared with both 2011 Arizona and 2011 Florida. The Gaels will not miss the Tournament as Wichita State did, and so their comparisons might suggest that there is potential for a deep run.

If you enjoyed this article, be sure to subscribe to the blog via the form at the upper right! And don’t forget to come back next Thursday for the next Similarity Score post!