Your cross-validation score is probably too high

Image

You spent a week building a model on historical data. The cross-validation numbers look solid. You hand it off, and within a few months the live results are nowhere near what the validation promised. At this point most people blame the model. The more likely culprit is the score itself. 


If your data has timestamps, the standard cross-validation setup gives you numbers that cannot be trusted. Here is why.

The assumption underneath k-fold

Standard k-fold works by slicing your dataset into folds, rotating which fold gets held out as the test set, and averaging the scores across rounds. There is a quiet assumption baked into this: the rows do not depend on each other. Pick any two at random, and knowing one tells you nothing about the other.

That holds for survey responses or medical records where each row captures an independent subject. It breaks completely for anything measured over time.

What actually happens with time-series data

Think about a feature that looks back three days to build a signal. Two rows recorded one day apart share two of those three days of input. Their targets often overlap too, if the outcome you are predicting takes more than a moment to resolve.

When k-fold drops one of those rows into training and the neighbouring row into the test set, the model is not predicting an unseen future. It is recognising data it has already handled. The score goes up, but only because the wall between train and test has a hole in it.

You can make this concrete by building a target out of pure noise and running a standard forest through shuffled k-fold. The reported score will often look respectable. Replace the shuffled split with one that respects time order and blocks overlapping rows, and the same model collapses to chance, which is the correct answer for noise.

Closing the gap

The fix is not about switching algorithms. It is about adjusting the split.

Two small rules do most of the work:

  • Purging. Before evaluating on a test window, remove any training rows whose label period reaches into that window. The model should not have handled any outcome that belongs to the test period.
  • Embargo. Add a short gap between the end of the test window and the next training block. Rows immediately after a test period are still correlated with it, and including them quietly re-introduces leakage through the back door.

Together, purging and embargo rebuild the wall. The model now has to find real structure in the data to get a good score rather than borrow it from nearby rows.

A Python library called purgedcv implements both. The source is on GitHub. It slots into scikit-learn wherever you currently useKFold, so switching is mostly a one-line change. It also supports combinatorial splits that produce several independent backtest paths from the same dataset, which gives you a much more honest picture of whether your model generalises or just got lucky on one particular slice of history.

What to take away

A high cross-validation score on time-series data is a question, not an answer. The first thing to check is whether the split could see around corners. Tighten it with purging and an embargo, and the score that survives is one you can actually build on. A lower honest number beats a higher one that falls apart in production.

Image
Previous Post Next Post