The Unreasonable Effectiveness of Data Scientists

My romance with data science began when someone recommended the book “Moneyball” by Michael Lewis. If you haven’t read it, please do. At the very least, watch the movie. Moneyball is the story about the transformation of the Oakland A’s baseball team from being one of the worst teams in baseball to a team that set the american league record of 20 wins in a row. One of the main reasons for this transformation is their reliance on statistics instead of general gut feeling and domain expertise, the way baseball was always managed. To understand how amazing this transformation is, let’s look at baseball numbers. The average salary of a team is about 90 million dollars a year, which is roughly the average number of wins in a season. This means that each win costs a baseball team about a million dollars. I don’t know how much the A’s paid Paul DePodesta, their data scientist, but I’m sure he was very well worth it. Of course most data scientists do not work for sports teams, but the stakes are even higher with internet companies considering current valuations. Companies like Google, Facebook and my current employer, LinkedIn, have really grown immensely and become Fortune 500 companies with 10 times fewer employees than other companies on that list. There are a multitude of reasons for these companies’ success, but their proficiency in handling data is surely one of them.

In 2009, Google released a paper titled The Unreasonable Effectiveness of Data and though I agree with most of it, I would argue that the effectiveness comes first and foremost from the data scientists themselves. Most people see the role of the data scientist as one who takes a problem, gathers some data, applies some machine learning algorithm and gets results in a form of a chart. Let’s look at this process more closely.

Problems – More important than solving a data problem, is finding the right data problem to solve. The right problem is one that can move the needle on a metric important to your business. However, most of the time, this is not a data problem. Great data scientists use data to find a chain of causes and effects that leads them to a solution of a data problem that also solves the business problem. For example,

  1. Netflix wants to retain customers after their trial period – problem
  2. Customers who watch more movies are more likely to sign up – cause/effect #1
  3. Customers who discover good movies will watch more movies – cause/effect #2
  4. Build a system to recommend movies to customers – solution

At first glance it doesn’t seem that building a recommender system for movies helps Netflix to retain more customers. Using this technique, not only helps to solve the problem, it also provides valuable data and insights to the rest of the company and significantly lowers the business risk of the project.

Data – Without context, data is just bytes on your hard drive. In the age of big data, people tend to measure companies by the amount of data they store, this is no more reasonable than measuring software by the number of lines in it. Not all data is created equal. A good data scientist knows the value of each data set in her possession, a great one will also know the value of those which are not and how to get them. One of the best examples of this principle can be found on the Google Image Search project. The people who worked on this project realized that getting more labels on the images they have will yield better results than improving their machine learning algorithms. In 2006, Google released a game where two players would receive the same image and their goal was to describe the image they were seeing. If both players used the same word, they would get points, and Google would get an invaluable piece of information about this image. The conception of the game did not involve fancy PhD level statistics, but a very clever sense about general problem solving.

Results – This is what truly matters. Michael Lewis did not write the book about the excellent analysis performed by Paul DePodesta on his computer to prove that the way scouts analyze players is wrong. In fact, Sabermetrics, the field of baseball analysis that Paul DePodesta based his analysis on, has existed since 1964, almost 40 years before Billy Beane, the A’s general manager, implemented it. The legend was born only after the analysis led to something that mattered, a record for most consecutive wins and got into the playoffs despite losing their three best players at the beginning of the season. Same goes for data scientists, the work does not end once the analysis has been completed. In fact, it just begins. Great data scientists know the difference between theory and practice and will follow through with their ideas to see them through to completion.

In summary, data science is a great tool to have in a company’s toolbelt and can have a disproportionate impact on its achievements. Great data scientists understand the business needs of a company, use data to find the best solution and make this solution a reality.

You should follow me on twitter