Data Science – You’re Doing It Wrong!

A lot of articles have been written about data science being the greatest thing since sliced bread (including this one by me). Data products are the driving force behind new multi-billion dollar companies and a lot of the things we do today on a day to day basis have machine learning algorithms behind them. But unfortunately, even though data science is a concept invented in the 21st century, in practice the state of data science is more similar to software engineering in mid 20th century.

The pioneers of data science did a great job of making it very accessible and fairly easy to pick up, but since it’s beginning circa 2005, not much effort has been made to bring it up to par with modern software engineering practices. Machine learning code is still code, and as any software that reaches production environments it should follow standard software engineering practices, like modularity, maintainability and quality (among many others).

The talk, Scalable and Flexible Machine Learning, which I gave with Christopher Severs in multiple venues reflected our frustration around this issue and proposed a solution that we feel brings us closer to where data science should be.

The first thing to understand is that data science is mostly manipulation of data. Usually the data can be of large scale and complex, but these manipulations are commonly found in non data science code bases as well. Even more, since in the work of a data scientist you don’t know which functionality achieves the desired result, for example whether the right metric is mean or median, the argument for modular code becomes much stronger.

Our second proposition was that the current tooling that data scientist use is inadequate for the type of systems that end up in production. Real world data processing in many cases is at least as complicated as writing regular software tasks such as fetching data from a database, passing messages between mobile devices or throttling the bit rate of a streamed video, only in the former case the tools being used are on one spectrum the SQL-like Hive and the very basic scripting language Pig. Although these language can be extended using user defined functions, those functions are very ad-hoc in nature and very difficult to reuse. On the other spectrum, there is the vanilla Java MapReduce which consists of an awful amount of boilerplate code that has very little to do with the actual desired functionality.

We tried to propose a more modern alternative that will combine the best of both worlds, concise and high level as Pig and Hive and a fully powered programming language like Java. Our technology of choice was Scalding which is a high level abstraction framework over MapReduce written in Scala. It has all of the functionality of Pig and Hive achieved sometimes with fewer lines of code. The fact that it is written in Scala, which is a modern language over the Java Virtual Machine is a great feature since all of the Java libraries can be reused. But Scala is not just a modern version of Java, it is a language designed for the functional paradigm.

I won’t go in this post on the what is a functional programming language and all the features that make it better for data processing, but if I would have to provide just a single proof to support my case, is that MapReduce is a very functional concept. Both map and reduce are higher-order functions, a cornerstone of functional programming. It is not a coincidence that Google chose this simple functional paradigm to be the base of their entire data processing framework.

Our last point was addressing the tendency of many data scientists to overcomplicate their algorithms. These data scientists begin their search for the solution for their problem by reading all the literature on the topic of their problem. One of the problems with this approach, is that it tends to over complicate a solution for the sake of a minor gain. The best example of this phenomena is the Netflix Challenge. Netflix offered one million dollars to a team that will beat its inhouse algorithm by at least 10%. The winning algorithm was so complicated that Netflix decided to pay the prize without actually implementing it. In many cases, the first algorithm can be an off-the-shelf algorithm that can be found in almost any machine learning package, for example: PageRank for ranking or Collaborative Filtering for recommendations.

To sum it all up, if you are writing your data processing code in complete disregard of all the engineering principles of the last few decades and developing your algorithms according to the state of the art publications without trying the simple ones first – You’re Doing It Wrong!

You should follow me on twitter

The Unreasonable Effectiveness of Data Scientists

My romance with data science began when someone recommended the book “Moneyball” by Michael Lewis. If you haven’t read it, please do. At the very least, watch the movie. Moneyball is the story about the transformation of the Oakland A’s baseball team from being one of the worst teams in baseball to a team that set the american league record of 20 wins in a row. One of the main reasons for this transformation is their reliance on statistics instead of general gut feeling and domain expertise, the way baseball was always managed. To understand how amazing this transformation is, let’s look at baseball numbers. The average salary of a team is about 90 million dollars a year, which is roughly the average number of wins in a season. This means that each win costs a baseball team about a million dollars. I don’t know how much the A’s paid Paul DePodesta, their data scientist, but I’m sure he was very well worth it. Of course most data scientists do not work for sports teams, but the stakes are even higher with internet companies considering current valuations. Companies like Google, Facebook and my current employer, LinkedIn, have really grown immensely and become Fortune 500 companies with 10 times fewer employees than other companies on that list. There are a multitude of reasons for these companies’ success, but their proficiency in handling data is surely one of them.

In 2009, Google released a paper titled The Unreasonable Effectiveness of Data and though I agree with most of it, I would argue that the effectiveness comes first and foremost from the data scientists themselves. Most people see the role of the data scientist as one who takes a problem, gathers some data, applies some machine learning algorithm and gets results in a form of a chart. Let’s look at this process more closely.

Problems – More important than solving a data problem, is finding the right data problem to solve. The right problem is one that can move the needle on a metric important to your business. However, most of the time, this is not a data problem. Great data scientists use data to find a chain of causes and effects that leads them to a solution of a data problem that also solves the business problem. For example,

  1. Netflix wants to retain customers after their trial period – problem
  2. Customers who watch more movies are more likely to sign up – cause/effect #1
  3. Customers who discover good movies will watch more movies – cause/effect #2
  4. Build a system to recommend movies to customers – solution

At first glance it doesn’t seem that building a recommender system for movies helps Netflix to retain more customers. Using this technique, not only helps to solve the problem, it also provides valuable data and insights to the rest of the company and significantly lowers the business risk of the project.

Data – Without context, data is just bytes on your hard drive. In the age of big data, people tend to measure companies by the amount of data they store, this is no more reasonable than measuring software by the number of lines in it. Not all data is created equal. A good data scientist knows the value of each data set in her possession, a great one will also know the value of those which are not and how to get them. One of the best examples of this principle can be found on the Google Image Search project. The people who worked on this project realized that getting more labels on the images they have will yield better results than improving their machine learning algorithms. In 2006, Google released a game where two players would receive the same image and their goal was to describe the image they were seeing. If both players used the same word, they would get points, and Google would get an invaluable piece of information about this image. The conception of the game did not involve fancy PhD level statistics, but a very clever sense about general problem solving.

Results – This is what truly matters. Michael Lewis did not write the book about the excellent analysis performed by Paul DePodesta on his computer to prove that the way scouts analyze players is wrong. In fact, Sabermetrics, the field of baseball analysis that Paul DePodesta based his analysis on, has existed since 1964, almost 40 years before Billy Beane, the A’s general manager, implemented it. The legend was born only after the analysis led to something that mattered, a record for most consecutive wins and got into the playoffs despite losing their three best players at the beginning of the season. Same goes for data scientists, the work does not end once the analysis has been completed. In fact, it just begins. Great data scientists know the difference between theory and practice and will follow through with their ideas to see them through to completion.

In summary, data science is a great tool to have in a company’s toolbelt and can have a disproportionate impact on its achievements. Great data scientists understand the business needs of a company, use data to find the best solution and make this solution a reality.

You should follow me on twitter