How Much Data is the right amount?

Neither of those common thoughts are necessarily true. They are general rules of thumb, but it is exceedingly easy to find a plethora of counterexamples.


Let’s talk about two easy ones to consider:

More data = better. If you’ve ever been to the Kentucky Derby; the house always hands you a beautifully presented book with all the statistics about all the horses and their jockeys.

If the data were all valuable and those betting could read the data, we wouldn’t have many surprises. Very few people would make money. However, the house provides a lot of information to muddle our decision-making. In short, most of the data they present is noise. Humans (and computers) are pretty bad at dealing with boisterous systems. It’s why there are as many stock picking algorithms as there are stock traders and why some people can make money while others lose. A computer will try to make sense of everything we feed it, and then we get the averages. That’s it. If, however, we carefully pick which data to use and as little of it as necessary, we can come away with a spectacular result.

More data is better for a concise amount of time, and then we revert to biases and averages. In many ways, it’s why the 24-hour news cycle has not been good about helping people to be better informed, but rather helping us all to confirm our own biases.

Garbage in, garbage out. This is 100% down to data preparation. One can easily take new data and come away with a garbage prediction. The fact is that making wrong predictions and classifications is much more the norm than the exception. Neural Networks and gradient boosted trees are not inherently better at predicting than a person. They are just able to scale better.

Here’s a quick example. I took ten columns of random uniform variables, partially sorted them, and fit them to a noisy sine curve. Then generated another loud data set and predicted.

Here is the result.

Original target:

Now using completely new garbage data, here’s the prediction, and it is not garbaged-out. It’s outstanding, out.

One more example. Being from Chicago, we love the Bulls. And it’s a deeply held belief (pretty much a fact) that the 95-96 bulls were the best basketball team ever. However, since the 1990s, the Bulls haven’t sniffed at an NBA championship. But if we were to predict who will win the NBA championship this year, we would anticipate the Lakers vs. The Celtics, with the Celtics beating the Bulls in the Eastern semis. That’s because these teams have the most championships, but historical data may not help us out if it’s too far back.

A good rule of thumb is that we want at least 4 data points of history for every moment forward we wish to predict. Remember, this is a rule of thumb, not a rule set in stone. For example, only the last three days may not be enough to forecast tomorrow with stock data. There are other considerations, but we also don’t want data from 1947; that probably won’t help much either.

So we have two options.

  1. feel it out, use our intuition (an inferior choice)
  2. backtest and set some heuristic optimizers to figure out how far back we want to go. (A much better option!)

Leave a Reply

%d bloggers like this: