Machine Learning Predictions of Playoff Series

Updated: September 6, 2013 at 11:08 am by Josh W


In my previous experiments I used Machine Learning to try and predict a single game in the NHL.  Machine Learning, as a reminder, is a subfield of Artificial Intelligence and is similar to statistics in that it is algorithmic modelling of data.  In ML the computer is able to learn from data and when given new data, can make predictions on the outcome.

I used ML in hockey to try and predict a single game outcome.  The results were not overwhelmingly strong.  I was able to achieve ~60% accuracy in a single game, which is better than the baseline, but still not that high.  I also looked at modifying PDO over the last n games to see if it helped with prediction and it did not.  Surprisingly, in a single game, traditional statistics, such as goals against, goal differential, and location were more important to the results than advanced statistics.

I was curious about this 60% and started looking at the observed standings in the NHL and what part of them were because of talent and what were a result of random chance.  Using classical test theory we could compare the observed results to that of a theoretical league and we could see that 38% of the observed results in the standings were a result of this random chance.  Further analysis using monte carlo method shows that the NHL is statistically similar to a league where the results are a cause of 24% skill and 76% random chance.  This means there’s a theoretical limit for prediction of ~62% in the NHL.


If predicting a single game is quite difficult due to random chance, I would hypothesize that over a longer period you’re more likely to see the better team win.  We can use this in Machine Learning and I would hypothesize you will get higher prediction accuracy rates if you look at playoff series than a single game.  It wont be perfect as you wont eliminate random chance.  Leonard Mlodinow explains it best in “The Drunkard’s Walk” in that teams who can beat another team 55% of the time will still lose a 7-game series, to the weaker team, 4 times out of 10.  And teams having a 55-45 edge, the shortest significant world series would be a best of 269 games.  So in this experiment I will try and predict the winner of a best-of-seven playoff series. 


I want to use both traditional RTSS statistics, as well as performance metrics, but this leads to the problem that performance metrics only go back to 2007-2008.  With 15 playoff series in a single year, 6 years of playoffs we have only 90 games to train, test and cross-validate on.  This is clearly a small sample size so we have to take the results with a grain of salt.I’ve put all of my data in a single spreadsheet which you can view here.  Some of these statistics are self explanatory and others you might have not heard of.  The features in each row are (season average, unless otherwise indicated):

  • Year: the year of the playoff series.
  • Team: the team name.
  • Win: did the team end up winning or losing?
  • Home? : Did the team have home field advantage.  This was an important feature in predicting a single game
  • Distance: Straight line distance between the two cities, used to try and represent the travel the teams had to partake in that series. 
  • Conference Standing: the ranking of the team in their conference that year.
  • Division Rating: This and the next three features come from the Z-Rating centre.  There is a lot of research in rankings as predictors in soccer and tennis. Division Rating is the BSWP average of the division.
  • Z-Rating: The actual z-rating of each team of the year, based on the Bradley-Terry System.
  • BSWP (Balanced Schedule Winning Percentage): the teams expected win percentage in an equal number of games against every other team in the group.
  • Strength of Schedule: The Z-Rating divided by the success ratio (ratio of wins to losses).
  • Season Fenwick Close: to represent possession
  • Score-Adjusted Fenwick: to represent possession
  • Season Corsi: to represent possession
  • Shooting Percentage
  • Save Percentage
  • Season PDO
  • Cap Hit
  • 5v5 Goals For
  • 5v5 Goals Against
  • 5v5 Goal Differential
  • Win Percentage
  • Pythagorean Expected Win Percentage
  • Points earned that year
  • 5/5 Goals For/Against Ratio
  • Power Play %
  • Power Kill %
  • Special Teams Index: Summation of PP%+PK%
  • Days Rest: Number of days since last game.
  • Games Total Played: Number of games played in the playoffs so far.
  • Fenwick Last-7: The fenwick average of each team over the last 7 games.  As we saw recently, the season average doesnt always reflect how well the current team is playing (i.e. trades, injuries etc). 
  • Corsi Last-7
  • Goalie Year Sv%: Based on the average of the goalies used over the last 7 games.
  • Goalie GAA: calculated the same way
  • Goalie Ev Sv%: Calculate the same way

For both teams in a playoff series their data is represented in a vector, V1 and V2.  The differentials of these vectors are calculated, i.e. V1′ = V1 – V2, V2′ = V2 – V1.  These new vectors are combined and given the label of which team won the series “Team1” or “Team2”.  This data was then fed into a number of algorithms.  I tried Decision Trees, Support Vector Machines, Neural Networks, Logistic Regression, NaiveBayes and meta classifiers (majority voting with SVM, NN and NB).


The initial results from these classifiers are as follows:

Classifier Result
Baseline 50%
JRip 51%
LibSVM 53.33%
J48 54%
k-NearestNeighbours 61.11%
NaiveBayes 64.6%
NeuralNetworks 68%


Logistic Regression 70%
SMO 71.11%

Further tuning of the Support Vector Machine SMO returns an accuracy of 74.44%. All testing was done using 10-fold cross-validation. SMO, Logistic Regression, NN, and Voting are all statistically significant from the baseline using a Paired T-Test (p<0.05). The Tuned SMO is not statistically significant compared to the untuned version or to NeuralNetworks. 

Using CfsSubsetEval we can determine which features contribute most to the outcome of the classifier. In this data the most useful features are the Z-Ratings, the Pythagorean Expected Win % and the Fenwick-Last 7.

Discussion & Conclusion

From first look at the results of the classifiers we can see they have significantly improved from predicting a single game. If we recall, those accuracies were in the mid-50s where as predicting a playoff series we are able to get an accuracy in the high 60s to low 70s.  Our best classifier is even able to get an accuracy of 74.44%. This accuracy is almost the same when we leave out 33% of the data and train on only 66% of it.  I feel confident that this classifier is a huge improvement. Even the simple classifier where the higher ranked team gets selected is only correct on 56% of this data.  

It was encouraging to see that the most important features were advanced statistics rather than the RTSS. I was initially surprised to see how little role they played in predicting a single game but as the sample gets larger, so does the important of the advanced metrics. I was surprised to see Z-Ratings to be more important than Fenwick over the last 7. I was also surprised to see PEwin% even in the top three features.  

For future work I definitely think it is work playing with Fenwick Last-n and see what the best number of previous games to go back is. I picked seven to look back at the previous playoff series, but that number may improve accuracy if we increase (or decrease) it.

I wanted to come up with a “Roster Strength” over the last n games as well. The problem was I could not find a single value for every player in the league (Corsi Rel, Avg TOI, GVT etc) that was easily scrape-able. I could write a script to go through BehindTheNet or Stats.HockeyAnalysis but that is going to take a long time. Similar to predicting a single game, I think it would be worth to run a monte carlo to see what the theoretical limit in predicting a single series is. Also getting more data would be nice but given we only have 15 series a year it is not going to happen in the foreseeable future.  

But then again, as Dave Nonis says, 48 games is a large enough sample size.


I was asked on twitter to try Random Forests which I did and it returned an accuracy os 64.44%.