Model Testing – Measuring Forecast Accuracy

When developing a betting model it is important to properly measure its performance. Having a formal measure of performance is important because it provides a benchmark with which to test alternative models.

Obviously, when using a model for betting a key measure of performance is the profit or loss obtained when using it, but it’s nice to have more than one metric to identify differences between similarly performing models.

This article is the second of a two-part series. The first article, Model Testing – Measuring Profit & Loss, looks at various ways to measure betting profit. This article outlines various measures of forecast accuracy in the context of betting models. The examples focus on measuring the accuracy of a model that predicts total game scores.

This article borrows heavily from Chapter 2 of Forecasting: Methods and Applications, by Makridakis, Wheelwright and Hyndman.

Understanding Sigma (Σ) Notation

Many of the formulas below using sigma notation. For those who can’t recall how to use Σ from their school days, click here to brush up on sigma notation.

Standard Statistical Measures

Below is a range of measures to evaluate the accuracy of a forecasting model. In the context of sports betting, applications include forecasting total scores in rugby, the number of corners in soccer and winning margins in basketball.

To help illustrate the computations involved, the following measures will be applied to a simplified data set. The table below shows rugby league total scores and a model’s pre-game forecast of those totals.

Game
i
Total Score
Yi
Forecast
Fi
Forecast Error
ei
1 53 50 3
2 36 52 -16
3 34 32 2
4 40 36 4
5 39 44 -5
6 51 54 -3
7 42 42 0
8 36 44 -8

 

If Yi is the actual total score for game i and Fi is the pre-game forecast of the total score for game i, then we define the error of the forecast for game i as:

Looking at the table above, the error for game 1 is 53 – 50 = 3, the error for game 2 is 36 – 52 = -16, and so on.

For the equations below we will use the variable n to denote how many completed games we have forecasts for.

The above data set has total scores and their associated forecasts for 8 games, so n equals 8 in this case.

Using our variables ei and n, we can calculate the following statistical measures of forecast accuracy:

Mean error (ME):

This measure will often be small because positive and negative errors will offset each other. This makes it pretty useless as a measure of accuracy because a model that has forecast errors of -2, +2, -2 and +2 will sum to 0, but so do -20, +20, -20 and +20. With that being said, the mean error is worth calculating because it will tell you if there is any systematic under- or over-estimating, which is called forecast bias. A model that tends to over- and under- estimate fairly equally will have a mean error close to zero, while a model that has a bias towards underestimating scores will have a strong positive value (note that ei = Yi – Fi, so if you underestimate the value, the forecast error is positive).

In the example data above the ME equals (3 + -16 + 2 + …)/8 = -2.875, illustrating that on average this model overestimated the total score.

Mean absolute error (MAE):

Mean absolute error gets around the offsetting effect of positive and negative forecast errors by taking the absolute value of each error. Using our data set above we get |e1| = |53 – 50| = 3, |e2| = |36 – 52| = 16, and so on.

The advantage of using MAE is it provides a scale which people can understand. For example, if you had the following forecast errors: -2, +2, -2. +2, the MAE would equal 2, showing the model forecasts are 2 units off the correct value, on average.

In the example data above the MAE equals (3 + 16 + 2 + …)/8 = 5.125. In written terms you can say the average estimated total score was 5.125 points off the actual total.

Mean squared error (MSE):

An alternative to taking the absolute value of each error is to square them, so a forecast error of -2 becomes 22 = 4. Like MAE, this avoids having positive and negative errors offset each other.

A point of difference between MAE and MSE is that MSE is more punishing for large errors, because squaring larger numbers produce markedly bigger results. For example the difference between 4 and 5 is just 1, but the difference between 42 and 52 is 9.

From a mathematics perspective many practitioners prefer to use MSE over MAE because squared functions are easier do deal with in optimisation calculations. From a calculus perspective it’s easier to take the derivative of a function with squared terms than a function with absolute value terms.

In the example data above the MSE equals ((32) + (-162) + (22) + …)/8 = 47.875. Note that this is significantly larger than the MAE due to the large error for Game 2.

Percentage / relative errors

The above measures are all dependent on the scale of the data. For example, these measures would likely be much larger for basketball total scores than rugby total scores because basketball scores generally are much higher. If you’re off by 10% on a basketball score this could imply being 18 points off, whereas with rugby 10% could mean being off by 4 points.

The previously discussed error measures make comparing models between sports very difficult. The following measures adjust for the scale of the data, which can facilitate comparisons between models applied to different sports.

Rather than use absolute errors, ei = Yi – Fi, percentage errors are used instead, which are calculated as follows:

Using our original data set, for game 1 the percentage error is PE1 = (53 – 50)/53 = 0.0566 = 5.66%.

Mean percentage error (MPE):

This is equivalent to ME discussed earlier, but it’s calculated using percentage errors.

MPE suffers the same drawback as ME through having positive and negative PEs offset each other, however this does mean it provides a measure of systematic bias.

In the example data above the MPE equals ((53 – 50)/53 + (36 – 52)/36 + …)/8 = -8.0%.

Mean absolute percentage error (MAPE):

MAPE is equivalent to MAE discussed earlier, but it’s calculated using percentage errors.

MAPE works well for total score models, but it isn’t ideal for all situations. For example football scores tend to be low, so using measures like MAE and MSE may make more intuitive sense than using MAPE where you can have errors like 400% if the total score is 4 and the model predicted 1.

A more serious limitation MAPE occurs when your data set can have 0 values. In the context of sports betting if you’re forecasting winning margins and a draw is possible then you can’t use MAPE. This is because if the final scores are level then Yi = 0 so PEi can’t be calculated due to a division by zero error. For this reason MAPE works best for modeling results such as total basketball, AFL and rugby scores rather than winning margins or football total scores, which can have zero values.

In the example data above the MAPE equals (|53 – 50|/53 + |36 – 52|/36 + …)/8 = 13.4%.

Comparing forecast methods

Once you have a measure of forecasting accuracy, how do you know if it’s a good result? Is a total score MSE of 9 or a MAPE of 6% good? What we need is a comparison of the model’s performance to more naive (basic) models to test if they represent a meaningful improvement.

In the context of forecasting total scores, suppose we have a naive model that predicts the total score for each game by simply using the total score from the last time the two sides met at the same venue. If in rugby league, Team A vs. Team B had a combined score of 38 the last time they met, then the naive model will predict the combined score for their next meeting to be 38.

Once a naive model has been created you can then calculate the forecast accuracy for it and compare its statistics to the accuracy calculations of the more sophisticated model. If the sophisticated model can’t outperform the naive model then it may mean going back to the drawing board.

Out-of-sample accuracy measurement

The above measures calculate the accuracy of a betting model, however achieving good accuracy using historical data doesn’t guarantee the model will work well in the future.

Suppose you obtain a historical odds data set and you identify a trading strategy that would have worked well over the past three seasons. How confident can you be that the strategy will continue to work in the future? As anyone who has analysed historical data can tell you, if you look hard enough, you will find some strategy that would have made a killing had it been employed in previous years, however this provides no guarantee for future success.

A way to know if a model is genuinely useful and not simply reflecting quirks in your specific data set is to split your data into two parts before constructing the model. The first part of the data is used to build and calibrate the model and the second holdout set is used to test whether the model works well on the second set of data.

We highly recommend you read our article, Post-Sample Evaluation – the Importance of Creating a Holdout Set before Calibrating a Betting Strategy. It outlines how to create a holdout set to test a calibrated betting model. This practice provides an out-of-sample accuracy measurement because it involves evaluating a forecasting model using more recent data than was used to calibrate the model.

Sources

An excellent book on forecasting is:

Forecasting: Methods and Applications
Spyros G. Makridakis, Steven C. Wheelwright, Rob J. Hyndman
December 1997, © 1998
ISBN: 978-0-471-53233-0

This article is based heavily on Chapter 2 of this book.

Share this:

 

Post Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.