What Statistics Can Learn From Data Science (and Vice Versa)

Since its inception around the turn of the 20th century, researchers have used classical statistics to analyze data sets. In general, the focus has been on analyzing a sample of the data, then generalizing the findings to the entire population. This was born of necessity, as until very recently, the technology hasn’t existed to allow people to analyze entire data sets containing millions of data points.

Because only a sample of the data is analyzed, statisticians spend a great deal of time ensuring that the assumptions allowing the sample results to be generalized are met. At least, they try to do so: in practice, statistical modelling is an exercise in how many assumptions you can violate without compromising your analysis; as a result, the methods that get used are not necessarily the most powerful, but the ones that are most robust to these violations.

On the other hand, machine learning models are typically validated by testing model performance against a holdout sample of data points that aren’t used to estimate the model’s parameters. Models that don’t perform well on the holdout sample are discarded. This helps guard against over fitting, because the model will only perform well if it contains those features that have predictive value for the entire data population.

To the extent possible, researchers using classical statistics should also use this type of external validation. This is true even when model assumptions seem to be met, as the data set itself represents just one point in time. When the number of data points is small, cross-validation methods, which test a series of models against a very small holdout sample (often one case) can be used. Since these types of models are often parsimonious, they might also be validated against similar data sets, with an aim to only including those features that maintain their predictive power across each data set tested.

In addition, researchers using machine learning should verify their model assumptions are met. This can be easy to overlook, because the assumptions are often less stringent than in classical statistics. It’s also tempting to focus on the model’s performance against the holdout sample as “proof” that it is correct: this is a potentially serious error, as it may not reveal systematic deviations from the assumptions that exist in both the training and validation data.

The foundation of classical statistics is largely due to Sir Ronald Fisher’s early agricultural experiments, conducted in the early 20th century. While his methods continue to be invaluable, it’s important to remember that they were designed for a world where the analysis was conducted by hand on a small set of data points. That’s not usually the case today: as a result, it’s critical to draw from the best insights of both classical statistics, and contemporary analytics, to develop accurate and reliable models for today’s world.