Predicting next season top 5v5 scorers using an artificial neural network

Updated: January 11, 2018 at 1:39 am by Sam Mercier

Predicting future scoring
rates is a hot topic in the analytics community, and with good reason: the
ability to score goals literally explains half of the success or failure of a
team.

Just a few weeks ago, Travis Yost investigated the statistics that best predict future scoring rates, and
showed that the scoring rate and relative Corsi for during the previous season
are equally good predictors. Another study to mention comes from Eric Tulsky, who showed
that the scoring rate of not only the previous season, but of the other ones
before should be considered to predict future scoring rates. Going back more than one season helps distinguish skill from luck and estimate the true
offensive quality of a player. 

From what I have seen, most
models developed to predict scoring rates are relatively simple and only use one
or two statistics for the prediction, the most important one
being the scoring rate of that same player the previous season. Yet, as we have previously discussed, hundreds of new statistics have been made available over
the last decade to describe the 5v5 performance of NHL forwards. Each of
these new statistics can contain a tiny bit of information regarding the
offensive quality of a player. If a modeling approach is able to find this
information, we will be able to predict with a greater accuracy future scoring
rates.

The increasing number of
statistics also provides the opportunity to use modeling approaches that go
beyond fitting a line or a curve, and it is exactly what we will do here using
a modeling approach called an artificial neural network. But first, let’s start
with a simple approach.

The
simple model

We will try to predict the
number of 5v5 points per game played (ESPTS/GP) that NHL forwards will obtain
next season. First, let’s do so using a simple linear model. If we take the
ESPTS/GP obtained by forwards from 2007-2008 to 2015-2016 (with a minimum of 200 min
of 5v5 TOI) and we regress their ESPTS/GP with the ESPTS/GP that
they have obtained the previous season, we obtain the following model: ESPTS/GPnext_season = 0.67*ESPTS/GPprevious_season
+ 0.11. To give us an idea of the accuracy of this model, here is a plot
of the ESPTS/GP predicted by the model compared to the actual ESPTS/GP that
forwards have obtained between 2007-2008 and 2015-2016:

NN_Fig1

The model is statistically
significant (R2 = 0.42)
meaning that, unsurprisingly, the offensive production of a forward is not
totally random, and there is a greater chance that Sidney Crosby will deliver a
quality offensive season than Matt Martin. However, statistically significant
does not mean accurate; in this case, the average difference between the model
predictions and the actual ESPTS/GP is of 0.11, a fairly significant error that
is important to keep in mind when applying the model.

If we take Eric Tulsky
advice and predict the ESPTS/GP from the ESPTS/GP that forwards have obtained
over the two previous seasons, we obtain the following model: ESPTS/GPnext_season
= 0.48*ESPTS/GPprevious_season + 0.30*ESPTS/GPtwo_seasons_ago  + 0.07. The accuracy of this model is indeed
a little higher (R2=
0.48) and indicates that going back two seasons is relevant to quantify offensive qualities. This makes a lot of sense, given that players sometimes
have seasons with an offensive production exceeding or below their real quality
level, for instance because of injuries or driven by unusually strong or weak
linemates. 

We will now attempt to
develop a more complex model taking into consideration more statistics
describing the past performance of forwards, and our objective will be to
improve the prediction accuracy in comparison with these simple linear models.

Let’s
get in our head

NN_Fig2

When Scientifics (myself
included) are not smart enough to develop new modeling algorithms, they do the
next best thing: they copy someone else. Or in this case, something else: our
brain.

If you remember your biology
class, then you may recall that the brain is composed of billions of neurons
interacting with one another. Each neuron has multiple dendrites which receive input signals from multiple other neurons. An input signal can either excite or
inhibit the neuron, and the balance of excitement and inhibition will trigger
the transmission of a specific output signal to the following layer of neurons
and so on, until the balance of activated and inhibited neurons is interpreted by
the last layer of neurons to maintain cognitive functions and perform an
action.

An artificial network, as
illustrated above, attempts to process information in a somewhat similar
manner. The inputs of an artificial neural network are the variables that we
want to use for the prediction. In our case, the input variables will be all
the statistics that we can find describing the 5v5 performance of a forward during its two prior
seasons. These input variables are combined by the neurons,
but not in equal manner: some inputs are given more weight than others. The
output value of a layer of neurons becomes the input value of the following
layer and so on, until our last layer of neurons whose output is the variable
we want to predict. In our case, the final output of our artificial neural
network will be the prediction of the ESPTS/GP that each forward will obtain
during the following season.

In summary, you can see an
artificial neural network as a set of very simple models (each neuron) which,
when combined together, generate a unique and very powerful one able to predict
complex things such as the weather and the presence of a pedestrian that a
self-driving car should not run into.

In the artificial neural
network, the weights that determine the importance given to each input variable
on the prediction of the output are like the regression coefficients that we
estimate when we fit a line or a curve: they are adjustable parameters that we
will try to estimate from the historic values of ESPTS/GP. 

And
the 2016-2017 best 5v5 offensive player is…

So, let’s get to work and
estimate the value of the weight parameters in our artificial neural network
that we should use so that the model provides the most accurate prediction
possible of the ESPTS/GP. To do so, I have extracted all the 5v5 data of
forwards from corsica.hockey since 2007-2008, considering everyone with at
least 200 minutes of 5v5 TOI per season. I have also added the age, height and
weight of each forward using Rob Vollman statistic spreasheet, because it
probably makes sense that the season-to-season evolution of scoring rate is in
part a function of the age and physical attributes of players. In the end, we are
developing a model which uses 292 inputs variables (146 describing the 5v5
performance of forwards the previous season, and 146 describing the 5v5
performance two seasons before). 

We have 1730 ESPTS/GP obtained
by forwards from 2007-2008 to 2015-2016 that we can use to estimate the value
of the weight parameters in our artificial neural network to provide the best prediction. For those that are into that sort of
things and may want to repeat this process (for the others, you can jump straight to the next paragraph), here’s a bit more details on how I did
that. I first preprocessed the input variables using a probabilistic principal
component analysis with a threshold of 0.025, which I found to work best in
this case. I estimated the weights from the historical values of ESPTS/GP by backpropagation using Levenberg-Marquardt algorithm. In the algorithm, I kept
70% of the data for calibration, 15% for validation, and 15% as the test set.

Now that the
weights have been estimated and we have a functional artificial neural network,
here is a plot of the ESPTS/GP predicted by the artificial neural network
compared to the actual ESPTS/GP that forwards have obtained between 2007-2008 and
2015-2016:

NN_Fig3

As you can see, the
predicted vs actual values for the
artificial neural network are more densely packed around a line compared to the
simple linear model that we first developed, and the accuracy of the artificial
neural network is indeed higher (R2= 0.58).

The model remains imperfect,
given that some underlying factors affecting the ESPTS/GP of forwards (shooting
percentage, quality of linemates and opponents and time on ice per game,
notably) can vary significantly from season to season following patterns that
are hard to forecast. Nevertheless, the average difference between the model
predictions and the actual ESPTS/GP is of 0.09, meaning that the artificial
neural network is 20-25% more accurate than our initial simple model. The
higher accuracy indicates that the additional input variables considered in the
artificial neural network compared to the simple linear model indeed contain additional relevant information to predict future
scoring rates. An argument can also be made that this may be near the most
accurate prediction of future scoring rates we can possibly achieve, given the
number of statistics used for the prediction and the power of the modeling
approach.

And now, here is a graph of 2015-2016 top 20 scorers in terms of ESPTS/GP, followed by the artificial neural network prediction of the top 20 scorers for 2016-2017:

NN_Fig4

NN_Fig5

It is important to note here
that, given the artificial neural network uses the statistics obtained by the forwards over their two previous seasons, it does not predict the 2016-2017 ESPTS/GP for
those who were rookies in 2015-2016. In other words, those are
McDavid-less top 20.

The artificial neural
network predicts a great 2016-2017 season from John Tavares. It is certainly a
sensible choice, given that Tavares has been a premium point producer for
several seasons and, at 26 years of age, should be in his prime scoring years.

The artificial neural
network also predicts that the NHL 5v5 scoring race will be led by fairly young
guys. At 37 years of age, Joe Thornton is by far the oldest of the predicted
top 20, and all but two of them are below 30. The model also predicts that
ageless wonder Jaromir Jagr may at last face a notable decrease in 5v5
offensive production.

In terms of specific teams,
the Oilers should be happy to see Jordan Eberle on that list, right behind
Taylor Hall. If everything clicks, Eberle, McDavid (0.64 ESPTS/GP last season)
and Lucic (top 20 in ESPTS/GP last season) could be a pretty formidable unit.
The Stars can also feel good about their top line, anchored by Tyler Seguin and
Jamie Benn, as well as the Jets, who have two significantly underrated players
in Blake Wheeler and Mark Scheifele.

At the other end of the
spectrum, the Blackhawks, the Devils and the Panthers may have to deal with
important drops in the scoring rate of Patrick Kane, Mike Cammalleri and, as
mentioned above, Jaromir Jagr. I have not dug deeply into the specifics for
each player, but reasons for the predicted scoring decreases can include a
regression towards the mean in shooting attempts or on-ice shooting percentage,
as well as age taking its toll.

All in all, any model trying
to catch all the subtleties in such as complex game as hockey is bound to have
flaws, but it is certainly fun to look at the predictions of the artificial
neural network and it provides a small additional tool to build a great real or
fantasy team. It will also be interesting to circle back at the end of the
season to verify how the predictions have held up. 

Glossary

Term Definition
Artificial neural network A modeling approach used to predict an output
variable (in our case future scoring rates) from the knowledge of some input
variables (in our case the statistics describing the past 5v5 performance of
forwards) in a similar way that the brain
processes information. This modeling approach is generally used to model complex
phenomena with many input variables.  
Neuron A subunit of the artificial neural network,
which calculates an output value by combining its input values in such a way to
give more importance to some input values than others. 
Weight The parameters in the artificial neural networks
which determine, for each neuron, the importance attributed to each input
variable. Like the regression coefficients in a linear model, these are adjustable
parameters that we estimate using historical values such that the model
provides the most accurate predictions.