Sandeep Rajput : Predictive Analytics : Tutorials

A very brief review of Predictive Analytics | ||

While all projects have their peculiarities and history, often the four key steps are Though my site uses no JavaScript, this page experiments with MathJax to display math seamlessly with text. There may be some content from the books I've been working on, but of course the level of math is friendly in HTML. | ||

Larger image. The image was created using TikZ (a LaTeX package), and then converted from PDF to JPEG using Adobe Photoshop. | ||

1. Data Review | ||

Let us assume that data is available in a form that can be read by the available tools. Data Warehousing is covered well by many technical and non-technical publications or blogs through various channels: a simple web search will likely suffice. It is easy enough to figure out how to read a database or to aggregate data when one knows the utility or goal at the end of the data processing step. It could be as mundane as finance team needs it for forecasting.
| ||

A) Provenance | ||

The very first piece of useful information is to determine how many of the values are missing. Many times the data is the result of joining tables from various tables spanning databases, and a missing value has a trail that contains important information. For example, consider the case where one customer has never purchased a product. Typical processing steps will mark the relevant column as NULL. That is not necessarily the same as zero. If the product is Windows 8 and the last date on which the customer purchases anything is August 13, 2004, that is a completely different situation.
Consider a more realistic case where we join multiple tables based on a customer's unique ID. Let's say we begin with the customer information, and then join records containing purchases that meet a certain criteria. As the next step we join the returns. A customer who did not purchase the item of course could not have returned it. That is different from the case where the customer could have returned it but did not do so. Such distinctions are critical in approaching the problem. Practitioners, particularly with SAS, use | ||

B) Summarizing information | ||

Consider the case where we wish to learn more about a variable or field of interest. For ordinal or nominal cases, the preliminary approach is obvious. It is for continuous variables that one has to be careful. With the great leaps in computation, a A practical solution is to provide some sort of
For illustration, we use the margins of victory in NCAA Men's March Madness historical results from 1985 to 2012 (generated in R). Of course, the margin of victory is at least one. The average is 11.6195 but the median is 10. That tells us the data is skewed a bit to the right. That's expected because the margin is bounded on the lower end at 1. Standard deviation computes to 8.86. Of course the distribution is not Gaussian at all. In fact it is not even symmetric. If we plotted the histogram using That many games are close is no surprise. However, there seems to be a separate clump around 14 and another around 17, and perhaps one at 9. Relatively mismatched teams will likely have some very large margins, or maybe the margins become tighter after the first week? Has it been changing over the years? Are there more teams in the draw, or do they otherwise play more games? All these questions of course are possible hypotheses which the analyst or data scientist can weigh with available information. | ||

Larger image. | ||

Data binning can be thought of as Univariate frequencies are almost trivial but can be insightful. With | ||

C) The "sniff" test | ||

To some extent the above information is useful. How does that change if 94.3% of the values are zero? Because A measure of spread for non-zero values is only a little more involved, and doable in this case, computing to 44.34. That is a huge change. The coefficient of variation (defined as the ratio of standard deviation to average) \(c=\mu/\sigma\) moves from a large value of 101.86/24.91 = 4.09 to much tighter 44.34/436.9 = 0.10! To some it appears like magic but in reality it is grade-school math and analytical thinking. While the first two moments are easy to calculate, they are most meaningful only when the distribution of values or the | ||

B) Shape of measurements | ||

Consider the above case with a further modification. For the same average and the same standard deviation, the general spread could be | ||

B1) Discrete measurements | ||

Many measurements in big data are What if the distribution has a higher variance than mean? While the original definition arose from Bernoulli trials (only two outcomes, typically labeled 0 and 1), one can simply interpret failure as the This is by no means a comprehensive list of discrete distributions one should consider. In fact, before considering a disribution one would do well to apply logic. In one column of values (appearing to be an integer or a long integer) has many unique values and each value appears once or twice, there is a good chance it is a hash code. If it only seems to take one value, perhaps more data needs to be analyzed: it might just be the hour of the day and very often transactions in a database are recorded in (almost) exact temporal order. The caveat is for real-time and parallel processing systems.
**One or two unique values**: perhaps hour of day, hashed or anonymized Personally identifying information (PII) such as social security number (SSN), date of birth (DOB), and phone number. Number of digits offer strong hints. If you think you're seeing osme PII, the best course of action is to throw a security cordon around the data and limit access to the files or directories containing PII. Unless it is your job to deal with PII data and you have all the clearances needed.**Many unique values with similarly low counts:**Probably a primary key or Id, store number, zip codes (esp. zip+4)**Central tendency:**If it looks like counts, mean and variance are similar and the distribution is symmetric, Poisson distribution is advised. For overdispersed cases, consider Negative Binomial distribution**More than one peak:**Possibly a mixture of two or more distributions. It is an advanced topic, but looking for this pattern can be quite instructive
| ||

B2) Continuous measurements | ||

In the digital age, it is hard to rigorously define Consider what continuous measurements arise from. In the simplest case, one has In many cases, continuous measurements are **Velocity:**Number of items purchased or visits in a time unit. This is also known are*rate*. Someone who visited a supermarket 43 times in the past 180 days has a*rate*of 43/180 = 0.2389.**Fractions:**Fraction of visits to a web site where any link in the**Trending**box was clicked. As one can see, fractions are limited to a range of [0,1]. In some cases, they appear as percentages, but are easy to make out. Fractions are tricky to deal with, and we shall have more to say on that topic in a moment.**Acceleration:**If a user had velocity of 0.2389 in the first half of the year and 0.2944 in the second half, then the*acceleration*is (0.2944-0.2389)/(0.2389) = 0.2326. In other words, the rate at which the supermarket was visited grew by 23.26%. If the calculation is performed using observed values (i.e. 53 visits vs. 43 visits), precision is retained; however if such ratios are taken on what are effectively ratios themselves, the general variability is enhanced considerably. Acceleration is*also*a ratio.**Ratio:**a ratio is more general than a fraction or acceleration. If one considers the ratio of the visits to a portal this month to the visits by the person 2 months ago, several possibilities arise. Since both numerator and denominator are disjoint (i.e. have no overlap), both the denominator and numerator can be zero.
When the denominator is zero, a ratio is undefined. However, 0/0 is different from 3/0 in this case. The first case probably tells us about a very infrequent user, or perhaps the timescale we are looking to model is much larger than the one we sample at. As for the latter case, timescales are probably not as much of an issue: consideration needs to be given to the business and process one is analyzing. If it is a For ratios, plotting the logarithm is often revealing. One has to take care to add a small \(\varepsilon\) to the observed value before taking the logarithm, since \(\ln(0) = -\infty\). For fractions, the |