Welcome to Part 2 of a four-part series we’re calling “How To Cheat at Enrollment Predictions.” So, your predictive model isn’t as accurate as Capture Higher Ed’s Envision model, but you don’t want to admit it. If that’s the case, take these tips from some of our competitors to make a predictive model that looks great on paper but fails in the real world.
In Part 1, we looked at the dubious use of “Leakers,” which we know is an unfortunate term. Sorry. Today, we examine a trick that involves training and testing your model on the same data.
Predictive models, especially the next generation models employed in Envision, test hundreds or thousands of variables in combinations totaling one with a whole bunch of zeroes off to the right. Doing this, they pick up on subtle trends in the data and can explain that data very well.
How can that be a problem? Well, sometimes they fit that data a little too well, a problem called overfitting.
Dr. David Leinweiber intentionally set out to overfit a model to explain the S&P 500. He combined these variables:
- Butter production in Bangladesh;
- Butter Production in the United States;
- U.S. Cheese Production;
- The sheep population in Bangladesh;
- The sheep population in the U.S.
Using this model he was able to explain 99 percent of the variance in the S&P 500 (an R-squared of .99). But don’t go risking your 401k just yet. Outside of the 1983 to 1993 window, his model starts to smell like butter, cheese and sheep that have been sitting out in the sun too long.
Another example is charting the U.S. population growth over time. A linear model (straight line) is pretty good. A quadratic model (a little more complicated) is better. A quartic model (more complicated still) fits even closer to the historic trend.
The only problem is it predicts the U.S. population to drop to zero around 2050. Let’s hope that’s overfit.
The way to protect against this sort of overfitting is to hold out a portion of the data that the model doesn’t get to see as it’s running through its algorithms. Then test your model against that data — predicting data the model hasn’t seen, just like it will have to do in the real world. Judge your model by how well it explains that holdout data, not how well it explains the data it was built on.
But if you want to overfit your data and admire those gaudy accuracy statistics, by all means train and test your model on the same data set.
Tomorrow, we’ll continue our “how to” series on cheating at predictions with Part 3, “Predicting Close to the Target Date.”
By John Foster, Data Analyst, Capture Higher Ed