Post-Sample Evaluation – the Importance of Creating a Holdout Set before Calibrating a Betting Strategy


Most betting models use historical data to predict the future. In fact, at its core, sports betting is the act of forecasting sporting results. Bookmakers rely heavily on historical results to set the odds while enthusiast punters interpret the historical data themselves to identify positive expected value plays.

A major challenge when using historical results, however, is knowing how reliable they are to predict the future. Player trades, injuries and coaching changes create routine breaks in the data, while a team’s form will naturally undulate throughout a season.

Suppose you obtain a historical odds data set and you identify a trading strategy that would have worked over the past three seasons. How confident can you be that the strategy will work in the future? As anyone who has analysed historical data can tell you, if you look hard enough, you will find some strategy that would have made a killing had it been employed in previous years, however this provides no guarantee for future success.

This article will outline how to create a holdout set to test a calibrated betting model. This practice is a form of post-sample evaluation because it involves evaluating a forecasting model using more recent data than was used to calibrate the model.

Creating a holdout set involves splitting a data set into two parts. The first is used for model calibration and the second “holdout set” is used to test the model that was constructed using the first set. While this practice doesn’t guaranteed success, it should help to weed out any short term anomalies that may otherwise lead you to accidentally believe they reflect a persistent trend.

Why You Can’t Always Trust Historical Data – an Illustrative Example

The following example uses actual historical T20 Big Bash results, which can be found in the historical data section.

A common betting market in cricket is picking who will win the coin toss. Presuming the coin is fair it’s a terrible market to bet on, but bookmakers like it because they know the odds of each team winning are exactly 50% and they can eke out a safe long-term profit by offering odds below 2.00 on each team.

Using all available data from the 2011/12 season through to 2013/14, you will find that the away team won the coin toss 52.48% of the time. This means for coin toss odds over 1.906 you would have previously made a profit by backing the away team to win the toss in every game. In this instance, common sense allows use to understand that the 52.5% coin toss win rate is just a result of random events because we know the mechanics behind a coin toss. In most instances, however, you won’t be able to easily apply common sense to rule out historical trends that simply occurred by chance.

Switching sports for a second, suppose you see that a team covers the line in an NBA game 52.48% of the time when they play on exactly three days rest. Is this simply a quirk in the data? Or is there something valuable to be obtained from this result? Using a holdout set will help you to answer this.

Looking back at the cricket, suppose we split the T20 Big Bash data set chronologically into two parts: the first and earliest data set will be used to build and calibrate our model. In our example this will consist of the 2011/12 and 2012/13 seasons. The 2013/14 season will then be our holdout set.

If we look at the 2011/12 – 2012/13 data we find that the away team won the coin toss an incredible 63.6% of the time. Is there a profit to be found in this market going forward? We attempt to answer this question by testing our query on the holdout set to see if the trend is persistent. Suppose we hypothetically bet on the away team to win the toss throughout the 2013/14 season. What we find is that we make a big loss. Wagering 1 unit on the away team to win the toss at 1.92 odds throughout the 2013/14 season would have returned a -13.88 unit loss on 35 units wagered.

Using a holdout set in this case has prevented us from mistaking a short-term trend for something we can rely on.

How to Create and Use a Holdout Set

STEP 1: sort your historical data chronologically

For betting data this step is typically unnecessary because most historical sports data sets are already provided in chronological order.

STEP 2: decide whether you need to truncate your data

If you are fortunate enough to have a data set that goes back many years, it may not be wise to use all of this information. This is because of the diminishing relevancy of older results. For example, historical results from last year will be highly relevant because many of the players in the squads today were with the same squads last year. The same cannot be said for results 10+ years ago. If your data set goes back more than five years you may want to remove the earliest fixtures from your data.

Even if you’re looking for long-term averages, such as total match scores, older data won’t be as reliable because rules and regulations for sports change and the way the game is played often evolves over time as well.

Common sense and a familiarity with the league you are researching will help you to identify a suitable cutoff point. If you are at a loss for how long to go back, try testing your model using various sample sizes.

STEP 3: decide how large your holdout set should be

There is no strict science to this part. A longer holdout set facilitates a better examination of the betting model, however a longer test set facilities better calibration of the model. What we’re looking for is a happy compromise.

For large data sets I tend to make the holdout set equal to one full season of data. This should provide an adequate number of fixtures to test the model on. If, however, you’re using a data set with not many observations per year – for example the T20 Big Bash league only has 35 fixtures per season – you may want to use more than one season for the holdout set if the entire data set is big enough.

For small data sets I tend to make the holdout set equal to 20%-25% of the entire sample size. This may equal less than one season for the holdout set, but you can’t make the test data set too small either. As is the case with models based on data with small sample sizes, just be sure to treat your results with caution.

STEP 4: split your data into two sets

Because the most relevant information comes from the most recent data, always use the older portion of your data set for the model calibration and the most recent data for your holdout set.

STEP 5: build a model using the first data set

Test and calibrate various betting strategies using the first data set. See if there are any trends to exploit. Try to identify any strategies that would have delivered a profit if they had been consistently applied during the period in question. Calibrate the model/strategy to achieve the maximum profit during the period.

STEP 6: test the model using the holdout set

Once we have a model or hypothesis to test, rather than wait for more fixtures to conclude to conduct the test on – or, heaven forbid, we test the model through actual wagering! – we can test the model immediately using the holdout set. Does the trend that was discovered in the test data persist in the holdout set? If it does then we can be more confident that something worth exploiting was discovered. If not, then it’s back to the drawing board.


Conducting a post-sample evaluation using a holdout set can help you to gain more confidence in a betting strategy, but it still provides no guarantee that the strategy will be profitable in the future.

Looking back at the T20 Big Bash coin toss example, suppose the 2013/14 season had yet to commence. If you had used the 2011/12 season as the test data and the 2012/13 season as the holdout set, you would have found that the away team won the toss 64.5% of the time in 2011/12 and 62.9% of the time in 2012/13. Had you not been able to use common sense to dismiss this result, you may have concluded that the holdout set confirmed your hypothesis that betting on the away team was the way to go. This highlights the fallibility of post-sample evaluation on small sample sizes.


An excellent book on forecasting is:

Forecasting: Methods and Applications
Spyros G. Makridakis, Steven C. Wheelwright, Rob J. Hyndman
December 1997, © 1998
ISBN: 978-0-471-53233-0

Share this:


Post Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.