Data Science – You’re Doing It Wrong!

A lot of articles have been written about data science being the greatest thing since sliced bread (including this one by me). Data products are the driving force behind new multi-billion dollar companies and a lot of the things we do today on a day to day basis have machine learning algorithms behind them. But unfortunately, even though data science is a concept invented in the 21st century, in practice the state of data science is more similar to software engineering in mid 20th century.

The pioneers of data science did a great job of making it very accessible and fairly easy to pick up, but since it’s beginning circa 2005, not much effort has been made to bring it up to par with modern software engineering practices. Machine learning code is still code, and as any software that reaches production environments it should follow standard software engineering practices, like modularity, maintainability and quality (among many others).

The talk, Scalable and Flexible Machine Learning, which I gave with Christopher Severs in multiple venues reflected our frustration around this issue and proposed a solution that we feel brings us closer to where data science should be.

The first thing to understand is that data science is mostly manipulation of data. Usually the data can be of large scale and complex, but these manipulations are commonly found in non data science code bases as well. Even more, since in the work of a data scientist you don’t know which functionality achieves the desired result, for example whether the right metric is mean or median, the argument for modular code becomes much stronger.

Our second proposition was that the current tooling that data scientist use is inadequate for the type of systems that end up in production. Real world data processing in many cases is at least as complicated as writing regular software tasks such as fetching data from a database, passing messages between mobile devices or throttling the bit rate of a streamed video, only in the former case the tools being used are on one spectrum the SQL-like Hive and the very basic scripting language Pig. Although these language can be extended using user defined functions, those functions are very ad-hoc in nature and very difficult to reuse. On the other spectrum, there is the vanilla Java MapReduce which consists of an awful amount of boilerplate code that has very little to do with the actual desired functionality.

We tried to propose a more modern alternative that will combine the best of both worlds, concise and high level as Pig and Hive and a fully powered programming language like Java. Our technology of choice was Scalding which is a high level abstraction framework over MapReduce written in Scala. It has all of the functionality of Pig and Hive achieved sometimes with fewer lines of code. The fact that it is written in Scala, which is a modern language over the Java Virtual Machine is a great feature since all of the Java libraries can be reused. But Scala is not just a modern version of Java, it is a language designed for the functional paradigm.

I won’t go in this post on the what is a functional programming language and all the features that make it better for data processing, but if I would have to provide just a single proof to support my case, is that MapReduce is a very functional concept. Both map and reduce are higher-order functions, a cornerstone of functional programming. It is not a coincidence that Google chose this simple functional paradigm to be the base of their entire data processing framework.

Our last point was addressing the tendency of many data scientists to overcomplicate their algorithms. These data scientists begin their search for the solution for their problem by reading all the literature on the topic of their problem. One of the problems with this approach, is that it tends to over complicate a solution for the sake of a minor gain. The best example of this phenomena is the Netflix Challenge. Netflix offered one million dollars to a team that will beat its inhouse algorithm by at least 10%. The winning algorithm was so complicated that Netflix decided to pay the prize without actually implementing it. In many cases, the first algorithm can be an off-the-shelf algorithm that can be found in almost any machine learning package, for example: PageRank for ranking or Collaborative Filtering for recommendations.

To sum it all up, if you are writing your data processing code in complete disregard of all the engineering principles of the last few decades and developing your algorithms according to the state of the art publications without trying the simple ones first – You’re Doing It Wrong!

You should follow me on twitter