The Problem With Too Much Data: Mistaking The Signal For The Noise

from the quantity-over-quality dept

The NSA can’t get enough data, as is evidenced by its shiny, new data center and its multiple efforts to either bypass laws entirely or have them rewritten in its favor. General Alexander, in particular, wants all the data. Everything. And as Mike covered earlier, he’s not shy about grabbing the data first and worrying about the legality later.

In his enthusiastic pursuit for more data, Alexander seems to have bypassed any sort of confirmation that adding more data is helpful. Here’s one issue the indiscriminate data harvesting raised.

“He had all these diagrams showing how this guy was connected to that guy and to that guy,” says a former NSA official who heard Alexander give briefings on the floor of the Information Dominance Center. “Some of my colleagues and I were skeptical. Later, we had a chance to review the information. It turns out that all [that] those guys were connected to were pizza shops.”

Tons of noise, or rather, tons of dots, the kind intelligence leaders seem to believe we’re still short on. Alexander certainly liked connecting dots, but seemed unconcerned if the resulting picture was completely unintelligible.

Under Alexander’s leadership, one of the agency’s signature analysis tools was a digital graph that showed how hundreds, sometimes thousands, of people, places, and events were connected to each other. They were displayed as a tangle of dots and lines. Critics called it the BAG — for “big ass graph” — and said it produced very few useful leads.

When you have tons of data, you have to filter out the noise if you’re going to use it any meaningful way. Alexander may have learned from the previous experience that while many terrorists may purchase pizzas, not everyone who purchases pizza is a terrorist. Hence the first level of “auditing,” as Marcy Wheeler points out at emptywheel.

As I noted last month, the NSA’s primary order for the Section 215 program allows for technical personnel to access the data, in unaudited form, before the analysts get to it. They do so to identify “high volume identifiers” (and other “unwanted BR metadata”). As I said, I suspect they’re stripping the dataset of numbers that would otherwise distort contact chaining.

I suspect a lot of what these technical personnel are doing is stripping numbers — probably things like telemarketer numbers — that would otherwise distort the contact chaining… I used telemarketers, but Alexander himself has used the example of the pizza joint in testimony.

In other words, it appears Alexander learned from his mistake at INSCOM that pizza joints do not actually represent a meaningful connection. His use of the example seems to suggest that NSA now strips pizza joints from their dataset.

Separating the signal from the noise is the first step for working with any large data set. But the NSA’s separation step operates under the assumption that every number with an inordinate number of hits is just noise. If the NSA is now stripping out eateries as possible connectors, it could very well be filtering out links to terrorists. Wheeler goes back through the series of missed connections by intelligence and law enforcement agencies that were uncovered after the Boston bombing.

I also suspect there may be one gaping hole in the NSA’s data relating to the Tsarnaevs: any calls and connections through Gerry’s Italian Kitchen.

Gerry’s was, if you recall, the pizza joint involved in the 2011 murder in Waltham: the three men were killed sometime between ordering a pizza and its delivery 45 minutes later. I’ve been told both Tsarnaevs had delivered pizza for that restaurant before then and Tamerlan may still have been.

But Gerry’s is also where the brothers disposed of some of their explosives the night of the manhunt, and it may well have been what brought them to Watertown.

So a connection to the brothers going back years when they worked there, a connection to the 2011 murder, and a connection (however tangential) to the manhunt. Yet (I’m guessing here) any ties the brothers had through that pizza joint would not show up in the dragnet collected precisely for that purpose, because such data is purged because normally pizza joints don’t reflect a meaningful relationship.

Here’s where the NSA’s collection activities become a damned-if-you-do, damned-if-you-don’t situation. Leave the pizza places in and everyone is linked to terrorists. Take them out and you delete helpful connections. The agency will probably point to the need to access more data, in order to somehow further filter the previously collected data. It has most likely already devoted several million dollars towards solving this conundrum — more analysts, more tools, more data. The one thing it hasn’t considered, apparently, is the simplest solution: targeted collections.

[B]ecause this was a dragnet, rather than a collection of the brothers’ calls, this pizza connection may have been hidden entirely in the data.

The continuous, ever-increasing flow of data into the NSA’s haystacks has just as much of a chance to bury useful connections as it has to bring them to light. Intelligence agencies don’t care much for targeted data acquisition, preferring to pick it up in bulk “just in case.”

It’s as though the collection of data is its own end. I suppose the only “fortunate” aspect of this dragnet is that its occurring in a digital age, thus keeping the NSA’s data centers from looking like interior shots of a particularly horrific episode of “Hoarders.” The theory is that this will prevent terrorist attacks. But in practice, it keeps looking as if our intelligence agencies could be just as ineffective with half the data.

Filed Under: , , , , , ,

Rate this comment as insightful
Rate this comment as funny
You have rated this comment as insightful
You have rated this comment as funny
Flag this comment as abusive/trolling/spam
You have flagged this comment
The first word has already been claimed
The last word has already been claimed
Insightful Lightbulb icon Funny Laughing icon Abusive/trolling/spam Flag icon Insightful badge Lightbulb icon Funny badge Laughing icon Comments icon

Comments on “The Problem With Too Much Data: Mistaking The Signal For The Noise”

Subscribe: RSS Leave a comment
29 Comments
Anonymous Coward says:

Re: Re:

Clearly you are lacking in imagination. The obvious move is to mark businesses with US addresses as potential dens of terrorist infiltrators. Then you also mark all businesses with non-US addresses as terrorist dens. It’s the only way to be sure you are collecting all of the data on possible terrorists.

Also since we are spying on people with connections to terrorists we can’t ignore the phone companies as sources of linkages. Terrorist use AT&T, AT&T has a business relationship with Verizon, therefor all Verizon customers are linked to terrorists with 3 links or less.

I feel dirty just for writing that, I wish it wasn’t so representative of how the government appears to be thinking.

Mega1987 (profile) says:

graphing the relationship...

Looks like NSA is attempting to make the Biggest Relationship diagram ever made in human history…

With over 6 BILLION people to cross-reference and relate to each other…

well… it’s better off to analyse a relationship diagram if Ranma 1/2 or Negima! than what NSA is planning to make…

But one thing is for sure…. there’s alot of people will have an “Annoyed at” relationship at anyone under NSA right now with this mass relationship diagram making…

They should have tried graphing their own ancestry instead of this…

Anonymous Coward says:

Six Degrees of Separation

The whole point of the small network model was to point out that it is easy to connect any two people through a small number of people to people relationships. However, unless a direct contact can be shown between two people, even having a contact in common does not mean they have any contact with each other, or any common goals.
Keith Alexander does not appear to appreciate this, and this make his approach to using data very very dangerous.

Kal Zekdor (profile) says:

Not sure exactly what method the NSA is using, but you can’t just look at people who have a shallow connection to a known criminal and expect to get any meaningful results.

Even direct in person connections are usually completely innocuous, e.g. a neighbor, college roommate, brother or sister, etc.

At best you may find a “potential” criminal by cross-checking connectivity maps of two or more known criminals (especially those who aren’t directly connected to each other). Someone who is closely connected (within 2 jumps, not 3) to multiple known criminals would be rather suspect, and may warrant further (non-intrusive) investigation. I say non-intrusive because we do not (ostensibly) believe in guilt by association. Innocent until proven guilty, and all that good stuff.

Then again, I’m not sure I should be giving the NSA tips on how to use the mountains of data they’ve illegally (or, at least, unethically) obtained. I’ve always had a fascination with large data sets, though. In another reality, it’s possible I could have been working for the NSA on just that sort of thing (and hopefully have had the courage to pull a Snowden).

New Mexico Mark says:

Why we'll never have real artificial intelligence

Obviously, a “find bad stuff” button is badly needed. Unfortunately, if you started inventing systems with enough intelligence do actually do this, the “bad stuff” flagged would be organizations like the NSA and politicians who create hostile/wasteful laws.

Hmmm…. Maybe that is a key reason why AI research is stalled right now. “What were you thinking? We didn’t mean REAL intelligence! Your research funding has been cut until we get more positive (to us) results.”

New Mexico Mark says:

Re: Why we'll never have real artificial intelligence

Actually, real analysts live in two worlds.

The first world is the complex, scary wold of big data where every “conclusion” is actually just the first piece of a larger puzzle and must be tested several different ways before being given any credibility. Almost everything is just shades of gray.

The second world is that of producing pictures and useless “find bad stuff instantly” tools for executives and tourists who can’t be troubled to think and won’t accept that this is an impossible goal. The best outcome a good analyst can hope for is that no one treats the pictures or buttons as actionable information.

Uriel-238 (profile) says:

Fixed

Why we’ll never have magical artificial intelligence.

In strategy games, playing against computer AI, I noticed this thing that if I built a closed castle, the opposing armies would bring siege engines to breach the gates or the walls (whichever was weakest). Yet, if I left the gates open and turned the courtyard into a killzone, they’d happily rush their armies in to get mulched.

It turns out that humans (real intelligence) often make this mistake as well.

Similarly, we’ll never have artificial intelligence that can discern terrorist activity from benign communication because real intelligence cannot agree which is which, much in the way judges cannot discern when erotic artistic media ends and porn begins.

The New York Times crossword puzzle designer was busted by the government for (coincidentally) adding too many code-words from Operation Overlord into the puzzle.

Anonymous Coward says:

The spying isn’t about catching terrorists. The spying is about attempting to control the world through political blackmail, corporate espionage, and oppressing dissident movements.

Refer to the spying on Brazil’s president, Brazil’s largest oil corporation, Bradley Manning, and Edward Snowden’s exile in Russia as proof of all three.

There’s many more examples of course. Edward Snowden understood what the oppressive global spying apparatus is really about.

He did his best to steer humanity away from it’s corrupt iron grip. We should attempt to do the same. Otherwise freedom will be lost, possibly forever.

Ajaxn says:

Time and Timing

Here’s a thought.

Everyone in this thread could be seen as connected, even though we’ve never met, or spoken to each other. We now share this meta connection.

Ditto everyone who has ever clicked a url, say as an entry in the mother of all meta nodes – google search, to read an article printed by this web site.

Then there would be our online ‘Trolls’ to consider. Trolls could be seen as ‘meta nodes’. Anyone who has ever encountered these online trolls will know they often have very wide agendas. Which means as nodes, anyone they target could be treated as if they were connected, just by virtue of who they have in common.

Meantime these trolls as meta nodes and functionaries of this system, would remain invisible as the cause or context for those connections.

Connections alone wont tell the whole story, you would have to look at the frequency of those connections, as well as the context of those connections in order to judge the relevance of that information.

As is often the case, we define x by what we seek. Those limited set of attributes defining x, could mean we fail to see other aspects of the information which might contradict our conclusions. In other words our answers are only as good as the questions asked, which are only as good as the attributes of information recorded. A lot of data doesn’t means a lot of useful data. Or put another way, some times you want in that data, information which allows you to exclude a particular result.

Add Your Comment

Your email address will not be published.

Have a Techdirt Account? Sign in now. Want one? Register here

Comment Options:

Make this the or (get credits or sign in to see balance) what's this?

What's this?

Techdirt community members with Techdirt Credits can spotlight a comment as either the "First Word" or "Last Word" on a particular comment thread. Credits can be purchased at the Techdirt Insider Shop »

Follow Techdirt

Techdirt Daily Newsletter

Techdirt Deals
Techdirt Insider Discord
The latest chatter on the Techdirt Insider Discord channel...
Loading...