The main issue with machine learning is over-fitting the model to the data. This is especially true of iterative models, such as decision trees, where each additional parameter necessarily improves the model’s fit with the training data. Ensemble learning models, such as the random forest, guard against over-fitting by estimating hundreds of models, then aggregating their results. The idea is that each model in the sample will contain slightly different information, while the ensemble model will filter out more of the “signal” from the individual model “noise.”
It seems counter-intuitive that adding randomness to a model would increase its accuracy. In ordinary usage, saying an outcome is random usually means that it is uncertain, or unpredictable, such as a roulette wheel. However, while it’s generally impossible to predict the outcome of a single random trial, the combined outcome of a large number of random trials can be known to a high degree of accuracy.
This is the principle underlying auto insurance: the insurance company can produce a close estimate of the number of auto accidents each day, even if they don’t know precisely which vehicles will be involved. This is because, under some very general conditions, the sum of a large number of independent random variables is normally distributed. And the total number of accidents is all the insurance company really needs to know.
This property of random variables–formally known as the Central Limit Theorem–has other interesting implications. Most people have heard of the normal “bell curve” that scientists use to model everything from stock prices to strike-outs. The curve was first used by the astrophysicist Carl Friedrich Gauss to correct measurements of distant galaxies for light scatter. The scatter is caused by the light from these galaxies interacting with stray particles in the Earth’s atmosphere. Gauss theorized that each particle interaction was a random event, and that the total impact of all interactions was the sum of these random events. Thus, the normal distribution models the effect of a large number of random, independent events on the measured variable.
The same justification is given in social science. For example, it seems reasonable that someone’s income would be influenced by a large number of variables, such as how much education they have, what type of job they have, and where they live. Therefore, social scientists model income using the normal distribution, with each explanatory variable contributing a partial effect to the final measurement. Just as the combined effect of numerous particle interactions is normally distributed, so is the combined influence of each explanatory variable.
You can even see the bell curve at work on the game show “The Price is Right”: the game “Plinko” has an inclined game board embedded with several dozen metal pegs. Contestants drop a Plinko disc from the top of the board, and it rattles down through the pegs before landing in one of the money slots at the bottom of the board. The path each disc takes is essentially random, so its destination is the combined effect of the impacts of each peg it hits.
These relationships have not gone unnoticed: here, a stock market advisory models stock returns using a physical “bean machine” that has the same operating mechanism as Plinko. The implications of this for one’s 401(k) deserve careful attention.