Introduction
This last week I spent at the European Conference on Machine Learning where there was a workshop on Sports Analytics. I happen to be presenting on some of my previous #fancystats machine learning in predicting a single game. The more I look into this the more I realize just how much random chance (or “luck”) plays a role in determining the outcome of a game in the NHL. But when I talk to people about this they just cannot wrap their head around the idea.
Often they will jump to mantras like, “the best athletes make their own luck” but it is because of the elite nature of the game and how close in talent the teams are that the little things that are outside of their control often cause teams to win a game. Things such as pucks bouncing off players into the net, or the puck going over the wrong end of the glass for a penalty. Luck also often comes in the form of high shooting percentages for extended periods of time. With low events (goals) and games often being won by a single goal, it becomes easier to see how luck plays a large role in the results.
The amount of luck, due to parity, causes there to be an upper bound in prediction limits for machine learning in sports and this is what I want to explore.
Background
In my first experiment I looked at predicting success in NHL games, that is, who will win or lose. I was training on 72% of the 2012 season games and using 14 features for each team including advanced statistics such as PDO, Fenwick Close and 5v5 goals For/Against and traditional stats such as goals for, against, differential and location. Regardless of what I tried I was not able to achieve higher than ~60% accuracy in predicting games in the NHL. Further analysis showed that it was traditional statistics that was helping with the prediction more than any other statistic. Specifically: Location, commutative goals against and commutative goals differential.
Then I looked at the observed standard distribution of win percentage in the NHL, since the last lockout, and it turns out that number is 0.09. I then ran a Monte Carlo method varying the amount of luck/skill required to win a game until I was able to find a league that was as similar to the NHL that as possible. This turned out to be that the observed NHL league is similar to a league where 24% of results were determined by skill and 76% by random chance. This implies a possible theoretical upper limit of approximately 62% for the NHL.
Experiment
My hypothesis in this experiment is that I can create a machine-learning classifier for any hockey league using those three traditional statistics to predict who will win a game with an accuracy that is better than the baseline and near the upper limits.
The first step is looking at a number of different leagues.
I wanted to capture different talent levels, ages and areas of the world so I went with the National Hockey League (NHL), American Hockey League (AHL), East Coast Hockey League (ECHL), Western Hockey League (WHL), Ontario Hockey League (OHL), Quebec Major Junior Hockey League (QMJHL), BC Hockey League (BCHL), Swedish Hockey League (SHL), Kontinental Hockey League (KHL), Czech Hockey League (ELH), and Australian Ice Hockey League (AIHL).
For each league I first looked at their schedule to calculate their commutative statistics on location, goals for, against and differential. Then I formatted this into files readable by Weka, for machine learning. I play with the different machine learning algorithms to see what the best classification rate I can acheive. I use 10-fold cross-validation on a single season of data for each of the leagues I also look at the season rate to see what the percent of games won by the home team is and use this as a simple classifier to compare our classifier to. In almost all leagues, across many seasons this seems to regress to 55%.
As a number of classifying algorithms were surveyed no tuning was done. Initial feedback appears our reported classifier numbers could be increased by 1-2% by tuning the SMO, NN, or Logistic Regression.
I then look at the parity of each league. I look at the observed win percentage of every team as far back as reasonably possible for each league. Some leagues can only go back a few years, such as the KHL which only exists from 2008 on, same issue for the AIHL. In leagues that have existed for a while (WHL, OHL, QMJHL) I tried to go back to around 1996-1997 (200-300 observed team seasons). This gave me an observed win percentage of every team in that period in which I could calculate the standard deviation. I then ran a Monte Carlo to calculate the approximate upper limit for machine learning so I could compare the difference between my classifier and the upper limit.
There are two things we have to acknowledge.
The first is that when looking at the the number of team seasons to calculate the standard deviation of win percentages we only have 200-300 at most. This leads to a small sample size issue.
The second is that because we have such a low number of observations we can’t confirm its distribution so in each league we have to assume that it is a binomial distribution (it could be a beta distribution for all we know, Nick Emptage will be exploring this more on PuckPrediction).
Results
League | NHL | AHL | ECHL | WHL | OHL | QMJHL |
Trained Season stdev | 0.109 | 0.07 | 0.102 | 0.15 | 0.134 | 0.166 |
Obs win stdev | 0.09 | 0.086 | 0.11 | 0.141 | 0.143 | 0.146 |
# Teams in MC | 30 | 30 | 30 | 22 | 20 | 18 |
Seasons in Obs Data | 240 | 231 | 238 | 324 | 298 | 283 |
Gms/Team in MC | 82 | 76 | 76 | 72 | 68 | 68 |
Upper Bound | 62% | 60.50% | 65% | 72% | 71.50% | 72.50% |
ClassifierTrained # Gms | 512 | 1140 | 720 | 792 | 680 | 612 |
Classifier % | 59.80% | 52.58% | 58.70% | 63.07% | 63.60% | 65.52% |
Limit Differential | -2.20% | -7.92% | -6.30% | -8.93% | -7.90% | -6.98% |
Home Win% | 56.80% | 52.50% | 55.90% | 55.50% | 55.50% | 53.80% |
Baseline Differential | 3.00% | 0.08% | 2.80% | 7.57% | 8.10% | 11.72% |
Classifier | Voting | Simple Log | Simple Log | Logistic | Logistic w/ Bagging | NaiveBayes |
League | BCHL | SHL | KHL | ELH | AIHL | |
Trained Season stdev | 0.178 | 0.132 | 0.143 | 0.089 | 0.15 | |
Obs win stdev | 0.155 | 0.115 | 0.137 | 0.119 | 0.191 | |
# Teams in MC | 16 | 12 | 26 | 14 | 9 | |
Seasons in Obs Data | 165 | 204 | 120 | 238 | 47 | |
Gms/Team in MC | 56 | 55 | 52 | 52 | 24 | |
Upper Bound | 73% | 69% | 70.50% | 66% | 76.50% | |
ClassifierTrained # Gms | 480 | 330 | 676 | 364 | 108 | |
Classifier % | 66.88% | 60.61% | 61.02% | 61.53% | 64.81% | |
Limit Differential | -6.13% | -8.39% | -9.48% | -4.47% | -11.69% | |
Home Win% | 52.90% | 52% | 56% | 61.50% | 48% | |
Baseline Differential | 13.98% | 8.61% | 5.02% | 0.03% | 16.81% | |
Classifier | SMO | Neural Network | Voting | SMO | simple Log |
I graphed the year to year parity levels of each of the leagues. The horizontal axis is the season from the current 2013-2014 back as far as 1996-1997. The vertical axis is the standard deviation of win-percents that year. A win in regular time, OT, or the shootout were all considered wins, and vice-versa for loses.
Discussion
This year to year graph I find quite interesting. First of all a lot of leagues are not very steady in their parity levels year to year. This could be because the less parity a league has the more volatile they become. The other reason is that each of those points only come from 12-30 team seasons which gives us small sample size.
It does still show us which leagues have more parity than others. Factors such as league changes can cause these leagues to shift in direction in terms of parity such as the introduction of the salary cap in the NHL (which is why I didn’t look past 2005 for this league). Ofter factors might affect other leagues that I am not aware of that might explain issues on the change in direction for the ELH towards a more parity league. Also interesting is that all leagues seem to progress from the around the same point in 2004 (same year as the introduction of the NHL – not sure if there is causation here).
I was surprised by the difference in parity between the KHL and the NHL, the two leagues which are generally considered the two best in the world. It makes sense when you look at how they are financed and how much is spent on talent between teams It was interesting to see that the the AHL has more parity than the NHL, my assumption would be that is because it is talent that is near the same, people who are just below NHL talent level, or younger players developing. You don’t see the superstars that the NHL would have. Also interesting is how stable the OHL is versus the QMJHL, but that could be a small sample size.
Another thing that I was surprised to see, although it makes sense by inspection, that talent does not equal parity. This goes hand in hand with the NHL vs KHL. The KHL is the league closest to the NHL (at least in terms of NHL equivalency points, assuming the NHL is the top league). Leagues with much lower talent have more parity than the KHL (I.e. the ECHL). So while parity != talent, as parity is the difference between your best and worst teams, the less parity your league has the easier it is to predict the leagues. The more parity the closer the games move towards a coin flip.
Interesting, and again its seems obvious, it is important how much parity the season of data you are training on has. If your data has more parity than the observed then the classifier accuracy moves closer to the baseline the home win % (and never seems to diverge lower than that).
This suggests we should be training on multiple seasons of data. I did this for the NHL and we can see the results below. Also interesting to see the best classifiers seem to never perform worse than the home field advantage. In most leagues this is around 55-56% (except in Australia where its 48-49%. I guess everything is backwards there).
I am sure this has been explored but it’s interesting that despite parity the home team always has an advantage. It can come from a number of reasons such as lack of travel, more rest, sleeping in their own bed, the home crowd pumping them up and officiating bias. I will leave this to explore for someone else and I am curious how games at a neutral site would affect classifier rates (but we don’t have too many of these in the NHL).
Trg | 2005 | 2005-06 | 2005-07 | 2005-08 | 2005-09 | 2005-10 | 2005-11 |
Test | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 |
NB | 58.1774 | 54.2718 | 57.6892 | 56.0618 | 52.8413 | 55.8991 | 56.0501 |
J48 | 58.1364 | 52.5631 | 58.4215 | 57.8519 | 55.4109 | 55.2075 | 56.3282 |
RF | 52.7258 | 51.2205 | 51.874 | 82.0993 | 52.319 | 53.214 | 49.235 |
NN | 58.2994 | 52.441 | 56.55 | 52.0342 | 53.4988 | 52.0749 | 51.3213 |
SVM | 55.0854 | 53.7022 | 55.8991 | 56.0618 | 51.8308 | 55.8991 | 56.8846 |
We can see how some of these algorithms are better than others and we can also see there’s some concept drift suggesting that training on too many seasons too far back isn’t very helpful to our predictions. Some work with Logistic Regression suggests that it still works at 58% of the time when training on 2005-2011 data. Redoing this an experiment with a league with less parity (and thus a higher ceiling) will likely exaggerate the differences in years accuracies.
So going back to our hypothesis, I feel confident that yes there is an upper bound / glass ceiling on how good our classifiers can do. This is correlated with the parity of the league, and also the number of teams playing and the number of games played in a year. There is an r-squared correlation on this data of 0.9269 between the parity of the league over all observed seasons and the upper limit for prediction.
The monte carlo method I feel strongly that it can calculate the approximate limit and it seems fairly intuitive. As your league approaches parity then the more even your teams become and the more of a coin flip it becomes. I feel the second part of the hypothesis has been addressed that, yes, we are able to create classifiers for leagues in hockey using those three features and it will perform better than the baseline. It is dependant on the parity of the data you train on so training on multiple years is key. I also tried adding new features, such as commutative goals for, and that did not statistically change the results.
I was also surprised by the accuracy I was getting from Logistic Regression and Simple Logistic Regression when my initial experiments showed the success of SMO and NN. These algorithms are typically great with noisy data such as in this case. Tuning these algorithms might boost them to have a higher overall accuracy but I doubt it would be statistically different.
Conclusion
In this article I looked at the glass ceiling of prediction limits amongst a number of hockey leagues and then I tried to create classifiers to reach this. I feel both of these were achieved successfully. It is interesting to see that the amount of parity in a league affects the overall prediction level and that talent in a league does not equal parity. The question then becomes, would you rather watch talented league that is full of parity and anyone can win? Or a talented league with little parity and your team either dominates or gets dominated?
As always, if you want to discuss any of this, feel free to drop me a line.