In the immortal words of Fox Mulder, “Nobody likes a math geek, Scully.”
Especially when they predict that the St. Louis Blues will win the Cup this year. Ugh.
And yet, there we are. So how exactly did we get to this horrible, horrible outcome? Let’s open up this particular X-File and find out…
There are quite a few models out there that aim to predict the outcome of playoff series, some at impossibly high rates (I’m looking at you, SAP). Many of them rely on a regression analysis between some combination of statistics and the outcomes of historical playoff series. As the Contrarian Goaltender points out in an excellent post on this topic, the result is often that you wind up overfitting a curve to match the results, without much thought to how the underlying statistics actually contribute to winning a series in the real world.
In order to avoid this, I went back to basics.
One very well understood idea in the hockey analytics community is that puck possession drives results, shot attempts are a good proxy for puck possession. So I took a look at the relationship between different measures of shot attempts and the likelihood of outscoring the opposition, because ultimately, this is what wins hockey games.
The following charts are based on 9,931 regular season NHL games from the 2007-08 season through the 2014-15 season. They look at even strength play only, and the statistics used are score-adjusted, rather than raw counts. All data was pulled from WAR-On-Ice.
I took the shot attempt % for each team in those 9,931 games and binned them in 1% increments. Then within each bin I counted up how many teams had more than half the goals (GF% > 50%), and expressed that as a percentage:
For simplicity, I did not count games where the score was tied, and due to small sample sizes in the tails of the distribution, I also discounted all games where shot attempt % was less than 30% or greater than 71%. That’s still 7,558 games worth of data.
So I think we can all agree that shot attempts (Corsi) have something to do with winning hockey games.
Here’s the same process for unblocked shot attempts (Fenwick), which as a slightly better correlation to outscoring the opposition:
And if we look at just straight shots on goal, the improvement is insignificant:
Clearly, we’ve hit diminishing returns and there is much more value in sticking with the shot attempt metrics than using shots because we can get to half decent sample sizes in a smaller number of games.
I also thought it would be useful to look at Scoring Chances (as defined and tabulated by WAR-On-Ice) given some indications that this might be a better predictor of future goals. That may be the case, but in an individual game, at the team level, Scoring Chances are not as well correlated with outscoring the opposition as the shot attempt based stats:
At least not at the tail ends.
Ok, so we know that shot-based metrics are correlated to outscoring your opponents, now what?
Predicting Shot Attempt %
Well, if we could predict the shot attempt differential for a given game, we would have a pretty good idea of which team might also score the most goals as well.
So I set about doing that and the closest I could get was to predict about 27% of the variation in shot attempts by using the shot attempts for and against of each team over the previous 20 games. For each team I averaged shot attempts for with the opponents’ shot attempts against to come up with predicted shot attempts, then combined the results for each team to determine the shot attempts %.
I tried factoring in days rest, amount of travel, even time zone changes but the only other significant variable was the difference in days rest. And even then the improvement in the correlation was tiny. Given that the point of this whole thing is to try to predict playoff series, the amount of rest is almost always the same for both teams anyway.
So while accounting for 27% of the variation in shot attempt % on a game-by-game basis is actually not bad, it’s not good enough to reliably make predictions with.
Or is it?
While I thought I had hit a dead end, it dawned on me that maybe I should skip the middle man. Does it really matter if the predicted shot attempts match with the actual shot attempts? The point is to predict outcomes, so why not cut out the middle man, er, intermediary stat? Let’s look at how the predicted shot attempt % correlated with goals for %:
Not bad. That’s almost as good as the correlation between actual score adjusted shot attempts and goals for. I should note here that I did the same for unblocked shot attempts and the correlation, although still good, was lower that for overall shot attempts.
Again, a reminder that this correlation holds over 5,000 regular season NHL games.
This is good place to talk about those games I discounted. Clearly it is fair to discount the games on each tail end as the small samples make it easy to skew results. But is it fair to discount games where the score is tied during even strength play? It definitely is makes things easier to discount those games, but does it affect the results?
Instinctively, I would say no. Especially when we are thinking about post-season play where the game will continue if the score is tied. But this probably needs a little more exploration to be sure that it is not skewing the end result.
Predicting Playoff Series
Now that have a way to reliably predict the probability of outscoring the opponent based on the expected shot attempt differential, let’s apply this to a playoff series.
The first thing we need to do is look at the difference in the relationship between shot attempts and likelihood of outscoring the opponent at home vs on the road. We know that a lot is spoken about the importance of home ice advantage and the data certainly bears this out:
The slopes of the two lines are not quite parallel, but if we look at the 50% mark, the difference between the two is a full 10 percentage points.
What this tells us is that your chance of being in the lead at a given shot attempt differential is 10 percentages points higher if you are at home as opposed to the road. That is a huge advantage, and one that is certainly important to consider when predicting the outcome of a playoff series.
Now that we’ve broken it down to home and away results, we have a simple model that takes shot attempts over the last 20 games* of the season and gives us the probablity of winning a single game. But how do we translate that to an entire series?
Well, if you look at the various permutations of wins and losses in a seven game series, there are 44 different combinations that can occur. I’m not going to get into the math, but suffice it to say, you can apply the probability of the home/road team winning each game and you get something like this:
The amazing thing here is that even if you had a team that only had a 33% chance of winning a given game, they would still be expected to win a seven game series once every six times. And if you recall the relationship between shot attempts and chance of outscoring the oppostion above, that translates to a team putting up about 42% of the shot attempts. So yes, even the Edmonton Oilers could win a playoff series. If they made the playoffs.
But back to the question at hand, we now have a model to predict shot attempt differential, which leads to chance of outscoring the opponent at even strength. Because this is a simple model, we’re going to leave out the impact of power plays. Yes, they will undoubtedly have an impact on the outcome of more than one series, but if you have a reliable way of predicting which ones, then you should probably stop reading this and head over to Bodog instead…
Instead, we’re going to assume that in terms of predictive probabilities, power plays will even out in proportion to the run of play, i.e. shot attempt differential.
So the chance of outscoring your opponent at even strenght essentially becomes the probability of winning a given game. This can be adjusted for home ice advantage and then used to give us the probability of winning a seven game series.
How effective is it this approach?
Applying it to last year’s playoffs results in 67% accuracy. Sure, that’s not at SAP level, but then I didn’t feed a multi-million dollar piece of software over 40 statistics either.
So now that all that is out of the way, let’s see what happens when we set this thing loose on the 2014-15 NHL playoffs:
The percentages are based on the probability of the home team winning each game and series. So the model predicts four relatively clear series wins for Ottawa, Pittsburgh, St. Louis and Vancouver.
Washington is a mild favourite to beat the Islanders, and the other three series are virtual toss-ups.
Moving to winners to the second round gives us:
The Canucks run into a buzz saw. Ouch.
On to the Conference Finals:
Beware the Penguins?
Nope. In a close-fought final, the Blues take it by a nose.
Ugh. Now I understand why nobody likes a math geek.
Turns out I was a little hasty in filling out my NHL.com bracket, so the image at the top originally showed the Lightning in the Final. I’ve since updated it to match the model, which has the Pens in there against the Blues.
Also, I’ve now had time to go back and test the model against every playoff year going back to 2007-08 and it is sitting at 68 out of 105, or just a shade under 65%. Not too bad but definitely not at SAP level.