The Problem With Too Much Data: Mistaking The Signal For The Noise
from the quantity-over-quality dept
The NSA can't get enough data, as is evidenced by its shiny, new data center and its multiple efforts to either bypass laws entirely or have them rewritten in its favor. General Alexander, in particular, wants all the data. Everything. And as Mike covered earlier, he's not shy about grabbing the data first and worrying about the legality later.
In his enthusiastic pursuit for more data, Alexander seems to have bypassed any sort of confirmation that adding more data is helpful. Here's one issue the indiscriminate data harvesting raised.
“He had all these diagrams showing how this guy was connected to that guy and to that guy,” says a former NSA official who heard Alexander give briefings on the floor of the Information Dominance Center. “Some of my colleagues and I were skeptical. Later, we had a chance to review the information. It turns out that all [that] those guys were connected to were pizza shops.”Tons of noise, or rather, tons of dots, the kind intelligence leaders seem to believe we're still short on. Alexander certainly liked connecting dots, but seemed unconcerned if the resulting picture was completely unintelligible.
Under Alexander's leadership, one of the agency's signature analysis tools was a digital graph that showed how hundreds, sometimes thousands, of people, places, and events were connected to each other. They were displayed as a tangle of dots and lines. Critics called it the BAG -- for "big ass graph" -- and said it produced very few useful leads.When you have tons of data, you have to filter out the noise if you're going to use it any meaningful way. Alexander may have learned from the previous experience that while many terrorists may purchase pizzas, not everyone who purchases pizza is a terrorist. Hence the first level of "auditing," as Marcy Wheeler points out at emptywheel.
As I noted last month, the NSA’s primary order for the Section 215 program allows for technical personnel to access the data, in unaudited form, before the analysts get to it. They do so to identify “high volume identifiers” (and other “unwanted BR metadata”). As I said, I suspect they’re stripping the dataset of numbers that would otherwise distort contact chaining.Separating the signal from the noise is the first step for working with any large data set. But the NSA's separation step operates under the assumption that every number with an inordinate number of hits is just noise. If the NSA is now stripping out eateries as possible connectors, it could very well be filtering out links to terrorists. Wheeler goes back through the series of missed connections by intelligence and law enforcement agencies that were uncovered after the Boston bombing.
I suspect a lot of what these technical personnel are doing is stripping numbers — probably things like telemarketer numbers — that would otherwise distort the contact chaining... I used telemarketers, but Alexander himself has used the example of the pizza joint in testimony.
In other words, it appears Alexander learned from his mistake at INSCOM that pizza joints do not actually represent a meaningful connection. His use of the example seems to suggest that NSA now strips pizza joints from their dataset.
I also suspect there may be one gaping hole in the NSA’s data relating to the Tsarnaevs: any calls and connections through Gerry’s Italian Kitchen.Here's where the NSA's collection activities become a damned-if-you-do, damned-if-you-don't situation. Leave the pizza places in and everyone is linked to terrorists. Take them out and you delete helpful connections. The agency will probably point to the need to access more data, in order to somehow further filter the previously collected data. It has most likely already devoted several million dollars towards solving this conundrum -- more analysts, more tools, more data. The one thing it hasn't considered, apparently, is the simplest solution: targeted collections.
Gerry’s was, if you recall, the pizza joint involved in the 2011 murder in Waltham: the three men were killed sometime between ordering a pizza and its delivery 45 minutes later. I’ve been told both Tsarnaevs had delivered pizza for that restaurant before then and Tamerlan may still have been.
But Gerry’s is also where the brothers disposed of some of their explosives the night of the manhunt, and it may well have been what brought them to Watertown.
So a connection to the brothers going back years when they worked there, a connection to the 2011 murder, and a connection (however tangential) to the manhunt. Yet (I’m guessing here) any ties the brothers had through that pizza joint would not show up in the dragnet collected precisely for that purpose, because such data is purged because normally pizza joints don’t reflect a meaningful relationship.
[B]ecause this was a dragnet, rather than a collection of the brothers’ calls, this pizza connection may have been hidden entirely in the data.The continuous, ever-increasing flow of data into the NSA's haystacks has just as much of a chance to bury useful connections as it has to bring them to light. Intelligence agencies don't care much for targeted data acquisition, preferring to pick it up in bulk "just in case."
It's as though the collection of data is its own end. I suppose the only "fortunate" aspect of this dragnet is that its occurring in a digital age, thus keeping the NSA's data centers from looking like interior shots of a particularly horrific episode of "Hoarders." The theory is that this will prevent terrorist attacks. But in practice, it keeps looking as if our intelligence agencies could be just as ineffective with half the data.