When Capture Higher Ed’s data science team mingles with the company’s sales force, sparks fly … mostly in a good way. During a recent meeting, the conversation turned to the various ways our competitors cheat at enrollment predictions. In other words, how do they make their predictive models look better than they really are?
We came up with four ways — and will highlight each over the next four days. Today, we’ll look at using “Leakers.”
Leakage isn’t just a gross-sounding word. It’s a real prediction problem.
Leakers are variables that contain some element of the outcome. So, if you’re trying to predict whether a student will enroll and you include the enrollment date in your model, you’re going to look like you crushed it.
Look! Your model predicted all these students with an enrollment date are going to enroll … and it was right! Yet obviously, we won’t know enrollment date when we’re trying to predict next year’s class.
Leakers aren’t always as obvious as enrollment date. In an application model, things as innocuous-seeming as GPA or religion are often leakers.
How can that be possible?
Think about when much of those data are recorded — often it’s after application. So by recording those data after application, you are creating a proxy variable for application, similar to using the application date in your predictive model.
Again, if you include these features in your model, you’ll get great accuracy statistics, but your model will fall apart when trying to predict next year’s class.
Tomorrow, we’ll look at another way our competitors cheat at prediction: By training and testing their models on the same data.
By John Foster, Data Analyst, Capture Higher Ed