The Unreasonable Effectiveness of Data Scientists

My romance with data science began when someone recommended the book “Moneyball” by Michael Lewis. If you haven’t read it, please do. At the very least, watch the movie. Moneyball is the story about the transformation of the Oakland A’s baseball team from being one of the worst teams in baseball to a team that set the american league record of 20 wins in a row. One of the main reasons for this transformation is their reliance on statistics instead of general gut feeling and domain expertise, the way baseball was always managed. To understand how amazing this transformation is, let’s look at baseball numbers. The average salary of a team is about 90 million dollars a year, which is roughly the average number of wins in a season. This means that each win costs a baseball team about a million dollars. I don’t know how much the A’s paid Paul DePodesta, their data scientist, but I’m sure he was very well worth it. Of course most data scientists do not work for sports teams, but the stakes are even higher with internet companies considering current valuations. Companies like Google, Facebook and my current employer, LinkedIn, have really grown immensely and become Fortune 500 companies with 10 times fewer employees than other companies on that list. There are a multitude of reasons for these companies’ success, but their proficiency in handling data is surely one of them.

In 2009, Google released a paper titled The Unreasonable Effectiveness of Data and though I agree with most of it, I would argue that the effectiveness comes first and foremost from the data scientists themselves. Most people see the role of the data scientist as one who takes a problem, gathers some data, applies some machine learning algorithm and gets results in a form of a chart. Let’s look at this process more closely.

Problems – More important than solving a data problem, is finding the right data problem to solve. The right problem is one that can move the needle on a metric important to your business. However, most of the time, this is not a data problem. Great data scientists use data to find a chain of causes and effects that leads them to a solution of a data problem that also solves the business problem. For example,

  1. Netflix wants to retain customers after their trial period – problem
  2. Customers who watch more movies are more likely to sign up – cause/effect #1
  3. Customers who discover good movies will watch more movies – cause/effect #2
  4. Build a system to recommend movies to customers – solution

At first glance it doesn’t seem that building a recommender system for movies helps Netflix to retain more customers. Using this technique, not only helps to solve the problem, it also provides valuable data and insights to the rest of the company and significantly lowers the business risk of the project.

Data – Without context, data is just bytes on your hard drive. In the age of big data, people tend to measure companies by the amount of data they store, this is no more reasonable than measuring software by the number of lines in it. Not all data is created equal. A good data scientist knows the value of each data set in her possession, a great one will also know the value of those which are not and how to get them. One of the best examples of this principle can be found on the Google Image Search project. The people who worked on this project realized that getting more labels on the images they have will yield better results than improving their machine learning algorithms. In 2006, Google released a game where two players would receive the same image and their goal was to describe the image they were seeing. If both players used the same word, they would get points, and Google would get an invaluable piece of information about this image. The conception of the game did not involve fancy PhD level statistics, but a very clever sense about general problem solving.

Results – This is what truly matters. Michael Lewis did not write the book about the excellent analysis performed by Paul DePodesta on his computer to prove that the way scouts analyze players is wrong. In fact, Sabermetrics, the field of baseball analysis that Paul DePodesta based his analysis on, has existed since 1964, almost 40 years before Billy Beane, the A’s general manager, implemented it. The legend was born only after the analysis led to something that mattered, a record for most consecutive wins and got into the playoffs despite losing their three best players at the beginning of the season. Same goes for data scientists, the work does not end once the analysis has been completed. In fact, it just begins. Great data scientists know the difference between theory and practice and will follow through with their ideas to see them through to completion.

In summary, data science is a great tool to have in a company’s toolbelt and can have a disproportionate impact on its achievements. Great data scientists understand the business needs of a company, use data to find the best solution and make this solution a reality.

You should follow me on twitter 

Advertisements

4 thoughts on “The Unreasonable Effectiveness of Data Scientists

  1. Maybe I missed something, but I don’t see the points about the effectiveness of data scientists anywhere here.
    What exactly did the data scientist do in the Baseball team?
    Was the Google tagging game designed by a data scientist?
    I disagree with saying “At first glance it doesn’t seem that building a recommender system for movies helps Netflix to retain more customers” and even if that is true, how is a simple problem and solution relevant specifically to data scientists?
    I think you should try to bring more concrete examples to prove the effectiveness of data scientists.

    Good first try though!

    • Hi Bah,
      An entire book and a movie were made about the use of statistics in the Oakland A’s. I didn’t think it would add value to reproduce this work on this post.
      The Google team that was working the Image Labeler definitely understood the concepts of gathering training data, so they were doing data science even if it wasn’t their exact title.
      Netflix could do so many other projects to try to improve, but the fact that they understood the importance of having great recommendations to their business that they were willing to give away a million dollar as a prize for the best recommender system, proves the importance of it to their business.

      Thanks for the kind words

      • Oh I think you misunderstood my point.
        I was not asking for clarifications on the examples you gave, I was asking for more focused data science applications and use cases.
        While the book and the movie you are referring to might explain the work done in the baseball team more thoroughly, you can’t assume your readers have already read the book or seen the movie; at the very least you should give a short summary of the relevant data science points. Otherwise the reader is left with very little actual information.

        And like I said, the writing looks polished and makes you want to read on, that’s why the feeling of “Wait, but I didn’t really learn anything meaningful about data science” is disappointing.

  2. I get the point. Data is critical and can provide some insights via simplistic machine learning techniques (say clustering), but without a Data Scientist putting it to use to solve specific problems you don’t get much. There are also good arguments about which is more valuable “better data” vs. “better algorithms”, which are compelling as well.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s