Sandeep Rajput : Predictive Analytics : Tutorials : Common Mistakes
Predictive Analytics : Common Pitfalls
Talk about common pitfalls when building predictive models. Target leak, out of time validation, attrition or propensity models, neural nets and reason codes. SVM and reason codes. Anti-discrimination regulation, new laws, future availability of data feed.

The common mistakes I've seen many analysts make are

1. Not being aware of one's own biases
I have squandered my resistance
For a pocket full of mumbles such are promises
All lies and jests
Still a man hears what he wants to hear
And disregards the rest

-- Lyrics from the song The Boxer performed by Simon & Garfunkel and written by Paul Simon

Nothing destroys credibility like being seen as an instrument for an ideology. A lot of otherwise smart individuals, come into a job thinking they are going to change the world and convince everyone else that their way of doing things was completely misguided. In their zeal, they might overlook some useful details: for example they might randomly choose to show the performance of their algorithm on a data set that just happened to prove their previous bias.

2. Style over meaningful Substance

A major source of heartbreak is putting in long hours when no one asked you to AND expecting a pat on the back. In Analytics, due to the highly specialized nature of the work, it is very possible for someone to know a lot about an algorithm as applied in one situation in one industry. While many great leaps have come from cross-discipline creativity, doing it solo normally never works, especially if you are inexperienced and completely unaware of a simple truth in that business and industry. For example, FCRA requires that all empirical algorithms used to generate one's creditworthiness also provide the reasons why they were declined more credit. Building a regularized neural network with high predictive power is not simply plug-and-play in such an industry. Outside of the regulations, such models need to be sold to the conventional bankers who, especially in a post-Lehmann Brothers world, would balk at black-box approaches.

3. Target Leak

Unless the system being modeled is purely deterministic, any model, statistical or machine learning is highly unlikely to return a classification rate of 98% or higher. Almost always, when that is the case, there is an error in model design or specification, known often as target leak. Consider the case of customer attrition, where the goal is to forecast the likelihood of having zero spend or closing the account. If a customer who otherwise spent more than 95% other customers reduces his spend to 10% of the previous total, there is a large likelihood the customer will close the account or simply stop spending. It does not need a model to learn that! If a top baseball player has not hit a home run in 60 outings, everyone already knows his stock is down!

Solution: Create a buffer between the observation period (where predictive variables are computed) and the perfromance period where the tag (or event) is computed or imputed. Sure, that would result in a much less predictive model, but in that setup the model is actually of some practical use.