Predicting future scoring

rates is a hot topic in the analytics community, and with good reason: the

ability to score goals literally explains half of the success or failure of a

team.

Just a few weeks ago, Travis Yost investigated the statistics that best predict future scoring rates, and

showed that the scoring rate and relative Corsi for during the previous season

are equally good predictors. Another study to mention comes from Eric Tulsky, who showed

that the scoring rate of not only the previous season, but of the other ones

before should be considered to predict future scoring rates. Going back more than one season helps distinguish skill from luck and estimate the true

offensive quality of a player.

From what I have seen, most

models developed to predict scoring rates are relatively simple and only use one

or two statistics for the prediction, the most important one

being the scoring rate of that same player the previous season. Yet, as we have previously discussed, hundreds of new statistics have been made available over

the last decade to describe the 5v5 performance of NHL forwards. Each of

these new statistics can contain a tiny bit of information regarding the

offensive quality of a player. If a modeling approach is able to find this

information, we will be able to predict with a greater accuracy future scoring

rates.

The increasing number of

statistics also provides the opportunity to use modeling approaches that go

beyond fitting a line or a curve, and it is exactly what we will do here using

a modeling approach called an artificial neural network. But first, let’s start

with a simple approach.

## The

simple model

We will try to predict the

number of 5v5 points per game played (ESPTS/GP) that NHL forwards will obtain

next season. First, let’s do so using a simple linear model. If we take the

ESPTS/GP obtained by forwards from 2007-2008 to 2015-2016 (with a minimum of 200 min

of 5v5 TOI) and we regress their ESPTS/GP with the ESPTS/GP that

they have obtained the previous season, we obtain the following model: ESPTS/GP_{next_season} = 0.67*ESPTS/GP_{previous_season}

+ 0.11. To give us an idea of the accuracy of this model, here is a plot

of the ESPTS/GP predicted by the model compared to the actual ESPTS/GP that

forwards have obtained between 2007-2008 and 2015-2016:

The model is statistically

significant (*R*^{2} = 0.42)

meaning that, unsurprisingly, the offensive production of a forward is not

totally random, and there is a greater chance that Sidney Crosby will deliver a

quality offensive season than Matt Martin. However, statistically significant

does not mean accurate; in this case, the average difference between the model

predictions and the actual ESPTS/GP is of 0.11, a fairly significant error that

is important to keep in mind when applying the model.

If we take Eric Tulsky

advice and predict the ESPTS/GP from the ESPTS/GP that forwards have obtained

over the two previous seasons, we obtain the following model: ESPTS/GP_{next_season}

= 0.48*ESPTS/GP_{previous_season} + 0.30*ESPTS/GP_{two_seasons_ago} + 0.07. The accuracy of this model is indeed

a little higher (*R ^{2}*=

0.48) and indicates that going back two seasons is relevant to quantify offensive qualities. This makes a lot of sense, given that players sometimes

have seasons with an offensive production exceeding or below their real quality

level, for instance because of injuries or driven by unusually strong or weak

linemates.

We will now attempt to

develop a more complex model taking into consideration more statistics

describing the past performance of forwards, and our objective will be to

improve the prediction accuracy in comparison with these simple linear models.

## Let’s

get in our head

When Scientifics (myself

included) are not smart enough to develop new modeling algorithms, they do the

next best thing: they copy someone else. Or in this case, something else: our

brain.

If you remember your biology

class, then you may recall that the brain is composed of billions of neurons

interacting with one another. Each neuron has multiple dendrites which receive input signals from multiple other neurons. An input signal can either excite or

inhibit the neuron, and the balance of excitement and inhibition will trigger

the transmission of a specific output signal to the following layer of neurons

and so on, until the balance of activated and inhibited neurons is interpreted by

the last layer of neurons to maintain cognitive functions and perform an

action.

An artificial network, as

illustrated above, attempts to process information in a somewhat similar

manner. The inputs of an artificial neural network are the variables that we

want to use for the prediction. In our case, the input variables will be all

the statistics that we can find describing the 5v5 performance of a forward during its two prior

seasons. These input variables are combined by the neurons,

but not in equal manner: some inputs are given more weight than others. The

output value of a layer of neurons becomes the input value of the following

layer and so on, until our last layer of neurons whose output is the variable

we want to predict. In our case, the final output of our artificial neural

network will be the prediction of the ESPTS/GP that each forward will obtain

during the following season.

In summary, you can see an

artificial neural network as a set of very simple models (each neuron) which,

when combined together, generate a unique and very powerful one able to predict

complex things such as the weather and the presence of a pedestrian that a

self-driving car should not run into.

In the artificial neural

network, the weights that determine the importance given to each input variable

on the prediction of the output are like the regression coefficients that we

estimate when we fit a line or a curve: they are adjustable parameters that we

will try to estimate from the historic values of ESPTS/GP.

## And

the 2016-2017 best 5v5 offensive player is…

So, let’s get to work and

estimate the value of the weight parameters in our artificial neural network

that we should use so that the model provides the most accurate prediction

possible of the ESPTS/GP. To do so, I have extracted all the 5v5 data of

forwards from corsica.hockey since 2007-2008, considering everyone with at

least 200 minutes of 5v5 TOI per season. I have also added the age, height and

weight of each forward using Rob Vollman statistic spreasheet, because it

probably makes sense that the season-to-season evolution of scoring rate is in

part a function of the age and physical attributes of players. In the end, we are

developing a model which uses 292 inputs variables (146 describing the 5v5

performance of forwards the previous season, and 146 describing the 5v5

performance two seasons before).

We have 1730 ESPTS/GP obtained

by forwards from 2007-2008 to 2015-2016 that we can use to estimate the value

of the weight parameters in our artificial neural network to provide the best prediction. For those that are into that sort of

things and may want to repeat this process (for the others, you can jump straight to the next paragraph), here’s a bit more details on how I did

that. I first preprocessed the input variables using a probabilistic principal

component analysis with a threshold of 0.025, which I found to work best in

this case. I estimated the weights from the historical values of ESPTS/GP by backpropagation using Levenberg-Marquardt algorithm. In the algorithm, I kept

70% of the data for calibration, 15% for validation, and 15% as the test set.

Now that the

weights have been estimated and we have a functional artificial neural network,

here is a plot of the ESPTS/GP predicted by the artificial neural network

compared to the actual ESPTS/GP that forwards have obtained between 2007-2008 and

2015-2016:

As you can see, the

predicted *vs* actual values for the

artificial neural network are more densely packed around a line compared to the

simple linear model that we first developed, and the accuracy of the artificial

neural network is indeed higher (*R ^{2}*= 0.58).

The model remains imperfect,

given that some underlying factors affecting the ESPTS/GP of forwards (shooting

percentage, quality of linemates and opponents and time on ice per game,

notably) can vary significantly from season to season following patterns that

are hard to forecast. Nevertheless, the average difference between the model

predictions and the actual ESPTS/GP is of 0.09, meaning that the artificial

neural network is 20-25% more accurate than our initial simple model. The

higher accuracy indicates that the additional input variables considered in the

artificial neural network compared to the simple linear model indeed contain additional relevant information to predict future

scoring rates. An argument can also be made that this may be near the most

accurate prediction of future scoring rates we can possibly achieve, given the

number of statistics used for the prediction and the power of the modeling

approach.

And now, here is a graph of 2015-2016 top 20 scorers in terms of ESPTS/GP, followed by the artificial neural network prediction of the top 20 scorers for 2016-2017:

It is important to note here

that, given the artificial neural network uses the statistics obtained by the forwards over their two previous seasons, it does not predict the 2016-2017 ESPTS/GP for

those who were rookies in 2015-2016. In other words, those are

McDavid-less top 20.

The artificial neural

network predicts a great 2016-2017 season from John Tavares. It is certainly a

sensible choice, given that Tavares has been a premium point producer for

several seasons and, at 26 years of age, should be in his prime scoring years.

The artificial neural

network also predicts that the NHL 5v5 scoring race will be led by fairly young

guys. At 37 years of age, Joe Thornton is by far the oldest of the predicted

top 20, and all but two of them are below 30. The model also predicts that

ageless wonder Jaromir Jagr may at last face a notable decrease in 5v5

offensive production.

In terms of specific teams,

the Oilers should be happy to see Jordan Eberle on that list, right behind

Taylor Hall. If everything clicks, Eberle, McDavid (0.64 ESPTS/GP last season)

and Lucic (top 20 in ESPTS/GP last season) could be a pretty formidable unit.

The Stars can also feel good about their top line, anchored by Tyler Seguin and

Jamie Benn, as well as the Jets, who have two significantly underrated players

in Blake Wheeler and Mark Scheifele.

At the other end of the

spectrum, the Blackhawks, the Devils and the Panthers may have to deal with

important drops in the scoring rate of Patrick Kane, Mike Cammalleri and, as

mentioned above, Jaromir Jagr. I have not dug deeply into the specifics for

each player, but reasons for the predicted scoring decreases can include a

regression towards the mean in shooting attempts or on-ice shooting percentage,

as well as age taking its toll.

All in all, any model trying

to catch all the subtleties in such as complex game as hockey is bound to have

flaws, but it is certainly fun to look at the predictions of the artificial

neural network and it provides a small additional tool to build a great real or

fantasy team. It will also be interesting to circle back at the end of the

season to verify how the predictions have held up.

## Glossary

Term | Definition |

Artificial neural network | A modeling approach used to predict an output variable (in our case future scoring rates) from the knowledge of some input variables (in our case the statistics describing the past 5v5 performance of forwards) in a similar way that the brain processes information. This modeling approach is generally used to model complex phenomena with many input variables. |

Neuron | A subunit of the artificial neural network, which calculates an output value by combining its input values in such a way to give more importance to some input values than others. |

Weight | The parameters in the artificial neural networks which determine, for each neuron, the importance attributed to each input variable. Like the regression coefficients in a linear model, these are adjustable parameters that we estimate using historical values such that the model provides the most accurate predictions. |