When Data Models and Rights Collide

My first position as a data scientist, was in the israeli intelligence, in a unit which is the equivalent of the American National Security Agency (NSA). I do not know anything about PRISM, NSA’s surveillance program, but my experience both in government positions and afterwards working for two big data companies helps me understand what are the drivers of the people who work on it.

The first thing to understand about this problem, is that it is not a big data problem. It is a huge data problem. Words cannot describe the amounts of data we are talking about. Storing all the communication information from Google, Facebook and the wireless carriers would require multiple data centers each the size of several football fields. Such amounts of data cannot be processed by people, instead, intelligence organizations rely on sophisticated algorithms to do most of the work to find this needle in a haystack they are looking for.

There are a lot of different challenges intelligence agencies are trying to solve. For this post I will focus on a single challenge – which individuals present a threat to national security. This problem is referred to in the machine learning world as a classification problem. Classification is the problem of identifying to which category an observation belongs, for example: given the level of force applied to a car’s door handle determine whether the alarm should go off, given an image determine whether it contains a human face, or given a person’s communication records determine whether she poses a threat.

The way to build a good classification model is to have a good training set. A training set is a set of records which you already know the answer for. If we use the face recognition challenge, the training will have some images that have faces and some that do not. It is very easy to come up with an extensive training set for the facial recognition challenge. Not so much for terrorists. There are just not so many cases of real threats to national security to learn from, and those that who exist, differ considerably from each other. A very small training set leads to a very inaccurate classification model.

In the case of national security, this is simply not good enough. Every threat that the model fails to detect might result in a very bad outcome. After 9/11 no one wants to be the person who let a terrorism event happen on his watch. This concern drives data scientist to constantly try to improve their models. Unfortunately, it is not always possible to improve models using the same training data. So when data scientists exhaust their modeling capabilities, they turn to get more data. But in the case of identifying people who don’t want to be identified, the most useful data is not one you would find in public records.

At a certain point, the fear of being responsible for people’s lives numbs the sense of right and wrong. People start to see the model and forget about its consequences. When every percentage point of improvement is another terrorist that will be detected before committing something terrible, privacy takes a backseat.

But privacy is not necessarily less important than safety. In some sense, privacy is safety and “National Security” is not a magic term that allows the government to trample over basic rights. In a democratic society, the government is required to do better than paternalize its people. Drastic times call for drastic measures, but these are not drastic times; so drastic measures call for drastic explanations.

disclaimer: This post does not represent the official opinion of anyone but myself. I have no knowledge whatsoever on PRISM or equivalent programs. This post is just my educated guess on what is driving the people who work on it.

You should follow me on twitter

Advertisements