The NSA and Big Data: The Power and Peril of Metadata

Big Data Big BrotherBloomberg Businessweek has a couple great pieces out that provide some good explanations for what metadata is and why the NSA is vacuuming it all up. The bottom line: your metadata may say just as much if not more about you than the actual content of your calls and emails.

First, what is metadata generally? Paul Ford of Bloomberg Businessweek, quoting Director of National Intelligence James Clapper in his piece “Balancing Security and Liberty in the Age of Big Data,” draws a useful comparison to the complete written contents of a book (the data) and the Dewey Decimal numbers on the back  that helps you classify that book (the metadata). Regarding the NSA phone scandal, however, Paul parts ways from Clapper on this analogy. Whereas Clapper compared phone metadata to Dewey Decimal numbers, Paul believes that a more accurate way to conceptualize the phone metadata being taken by the NSA is as the index at the back of a book. He justifies that take with the following:

What the NSA seems to be doing is treating hundreds of millions of people like open books and indexing them: Who are they, who do they know, where have they been, and so forth.

In other words, the NSA is building “big organized indexes of human beings,” categorizing the numbers of both caller and called, the locations of both, the time, etc.

Why does this matter, and why should you care? As long as nobody is actually listening in on your calls, no harm no foul, right? Well, not so fast.

Big Data is all about using new analytics tools to comb massive quantities of data points for hidden insights. That process starts by assembling data that’s “well-defined and cleanly organized.” Once that’s assembled:

you can employ beautiful, supple pieces of software—some with point-and-click interfaces and little icons—to help you understand what you’re seeing. It’s powerful stuff.

And you don’t necessarily need such software to draw out insights from that data, as the following story that leads Paul Ford’s piece demonstrates:

A very large Internet company once had the noble impulse to share some of its data with the research community. It made three months of log files from its search service available to all. The company took many steps to preserve privacy, removing personal information and randomizing ID numbers in the belief that this would make it impossible to identify any of the more than 650,000 customers who’d used the service. But Internet hobbyists, professional researchers, and journalists were able to ferret out many of the users. No. 4417749, for example, was a Georgia widow. Another user appeared to be planning a murder. Today, the AOL (AOL) Search Log Scandal is remembered as one of the weirdest missteps in Internet history.

In other words, metadata can be quite revealing, even without sophisticated data mining technology.

It will therefore be quite easy for the NSA to trawl those hundreds of millions of phone, email and other records they have collected with powerful analytics software and find patterns and links. What they find will depend on what the that software is asked to search for.

If that software is asked to search for patterns linked to terrorist activity, then great. So far, we’ve been told, that is exactly what it has been used for, reportedly to help stop a significant number of attacks.

If, however, that software is asked to find patterns and networks of political dissent, then not so great. In fact, it would be very dangerous for the dissenters, as another Bloomberg Businessweek article, “What If the ‘Redcoat NSA’ Had Access to Paul Revere’s Metadata” by Joshua Brustein demonstrates. Referencing a recent analysis by Kieran Healy, an associate professor of sociology at Duke, who:

Through a series of relatively simple steps, he determined how people were connected to one another, how the memberships of various groups overlapped, and which people served as the most important connections within the revolutionary community. He quickly identified Paul Revere as the kind of person who might be best-positioned to warn his co-conspirators when the redcoats were coming.

In Keiran Healy’s own words:

For the simple methods I have described are quite generalizable in these ways, and their capability only becomes more apparent as the size and scope of the information they are given increases. We would not need to know what was being whispered between individuals, only that they were connected in various ways. The analytical engine would do the rest!

Joshua Brustein also spoke with Matt Blaze, an encryption expert who directs the Distributed Systems Lab at the University of Pennsylvania. Blaze said that “on a massive scale, when you look at everyone’s metadata, it becomes even more powerful, more revealing, perhaps, than content.” Why? Because “Unlike with content, there’s no real limitation on how much of it can be effectively processed… And the more metadata there is, the more revealing it is.”

Brustein ends his piece laughingly speculating that we could “be saluting a different flag” if the Brits had been able to data mine metadata, but the capability is definitely no joke.  I’ll steer clear of legal/constitutional issues here, as I’m no constitutional lawyer. But to me, how comfortable you should feel about these NSA metadata collection and mining programs ought to depend on who much you trust each and every user to use these powerful capabilities and insights properly and respectfully.

Call me uncomfortable.

For Paul Ford’s “Balancing Security and Liberty in the Age of Big Data,” follow this link.

For Joshua Brustein’s “What If the ‘Redcoat NSA’ Had Access to Paul Revere’s Metadata,” click here.