Why Do Models Underperform?

Nov 26, 2018

Your analytics team spent months learning the business, defining the problem, and developing the data. Another few weeks and the models were trained. All indications were that your new models and analytics capabilities were going to usher in a new phase for your organization. Fast forward six months and there you are, wondering why these models are not working as projected. What happened? This thing must be broken, right? The short answer is: probably not. While a defect in the transition from design, to test, to production is always possible…more often, there are more subtle root causes for an “underperforming” model. These causes can exist at every phase of the project and, sometimes, are even out of our control.

As data scientists, we have more resources and computing power than ever before. Novel approaches and complex algorithms, once left to the pages of academic journals, are increasingly at our fingertips. With this has come an explosion in the number of trials, algorithms, iterations, and sampling approaches analysts use. Far too often this can lead to the traditional overfitting pitfalls we strive so hard to avoid, albeit in a different form. Improper usage of holdout and test samples are an increasing cause of overfitting when analysts begin choosing parameter settings and model structures based on how they perform in these partitions. This leads not to overfitting our models to our training data, but rather we have overfit them to the test data. When we use these results to set expectations for future results, we are often left “underperforming.”

When we set expectations for our models, we often rely on historical data as the inputs into our future projections. We use some previous population to estimate what our solutions will do moving forward – inevitably binding our success to these values. In situations where the future does not mirror the past, our models may not perform as we expect across all metrics. For example, if the event that we are trying to capture (a fraudulent transaction, an on-time payment, a response to a treatment, etc.) becomes more scarce, certain metrics will – by definition – begin to underperform compared to our expectations. In these situations, rather than providing stakeholders with a point estimate of a single metric, it is best to provide a range of possible values as well as additional (more stable) performance metrics to monitor.

Lastly, we often rely on this same historical data as the inputs in the model fitting process. If unplanned or unknown changes in the “shape” of that data are not accounted for, we can expect to see a different and likely “underperforming” model in production. For example, the Tax Cuts and Jobs Act of 2017 has created a situation where analysts can expect taxpayer behavior, and the underlying data, to change “shape” in many ways.

Any model, business rule, or other data driven process going into production for 2019, that does not account for these changes now, is almost certain to “underperform” next filing season.