Sandeep Rajput : Professional : Data Science

Big Data and Machine Learning in Enterprise Software (2013-Current)

Since 2012-end my focus has been squarely on using machine learning (including statistical learning) on very large datasets, particularly with data based on counts (and thus including a great many zeros) of events over time. In keeping with the ethos of our times, it translated to a much greater focus on open source software, specifically top-level projects in the Apache stack. I've also gone back and written C++ code for common optimization and predictive modeling architectures, to better understand the steps in estimation. The goal on that front of course is to parallelize those useful algorithms to the extent possible, and then study the loss in predictive power and gain in performance while refining it for use in real-life datasets.

I also took some time to do a broad survey of the relevant fields, and found myself interested in functional programming, specifically Clojure (with Incanter) and also in the newer languages such as Julia. Of course, one has to familiarize oneself with open source R and with the Ipython stack.

Big Data

What makes data big? The first argument offered would be the size of it all. However, a byte of data recorded is not necessarily equivalent to k bits of valuable information. There is information in data, to be sure, but how much of it is really meaningful?

There are many definitions of what constitutes big data, but none offers a satisfactory explanation. Gartner Research and Google trends seems to suggest the hype around Big Data is ending, but interpretation is also big business -- i.e. buy my integrated software package. There is a nice summary at KDD Nuggets.