The Problem With Too Much Data: Mistaking The Signal For The Noise

from the quantity-over-quality dept

The NSA can't get enough data, as is evidenced by its shiny, new data center and its multiple efforts to either bypass laws entirely or have them rewritten in its favor. General Alexander, in particular, wants all the data. Everything. And as Mike covered earlier, he's not shy about grabbing the data first and worrying about the legality later.

In his enthusiastic pursuit for more data, Alexander seems to have bypassed any sort of confirmation that adding more data is helpful. Here's one issue the indiscriminate data harvesting raised.

“He had all these diagrams showing how this guy was connected to that guy and to that guy,” says a former NSA official who heard Alexander give briefings on the floor of the Information Dominance Center. “Some of my colleagues and I were skeptical. Later, we had a chance to review the information. It turns out that all [that] those guys were connected to were pizza shops.”
Tons of noise, or rather, tons of dots, the kind intelligence leaders seem to believe we're still short on. Alexander certainly liked connecting dots, but seemed unconcerned if the resulting picture was completely unintelligible.
Under Alexander's leadership, one of the agency's signature analysis tools was a digital graph that showed how hundreds, sometimes thousands, of people, places, and events were connected to each other. They were displayed as a tangle of dots and lines. Critics called it the BAG -- for "big ass graph" -- and said it produced very few useful leads.
When you have tons of data, you have to filter out the noise if you're going to use it any meaningful way. Alexander may have learned from the previous experience that while many terrorists may purchase pizzas, not everyone who purchases pizza is a terrorist. Hence the first level of "auditing," as Marcy Wheeler points out at emptywheel.
As I noted last month, the NSA’s primary order for the Section 215 program allows for technical personnel to access the data, in unaudited form, before the analysts get to it. They do so to identify “high volume identifiers” (and other “unwanted BR metadata”). As I said, I suspect they’re stripping the dataset of numbers that would otherwise distort contact chaining.

I suspect a lot of what these technical personnel are doing is stripping numbers — probably things like telemarketer numbers — that would otherwise distort the contact chaining... I used telemarketers, but Alexander himself has used the example of the pizza joint in testimony.

In other words, it appears Alexander learned from his mistake at INSCOM that pizza joints do not actually represent a meaningful connection. His use of the example seems to suggest that NSA now strips pizza joints from their dataset.
Separating the signal from the noise is the first step for working with any large data set. But the NSA's separation step operates under the assumption that every number with an inordinate number of hits is just noise. If the NSA is now stripping out eateries as possible connectors, it could very well be filtering out links to terrorists. Wheeler goes back through the series of missed connections by intelligence and law enforcement agencies that were uncovered after the Boston bombing.
I also suspect there may be one gaping hole in the NSA’s data relating to the Tsarnaevs: any calls and connections through Gerry’s Italian Kitchen.

Gerry’s was, if you recall, the pizza joint involved in the 2011 murder in Waltham: the three men were killed sometime between ordering a pizza and its delivery 45 minutes later. I’ve been told both Tsarnaevs had delivered pizza for that restaurant before then and Tamerlan may still have been.

But Gerry’s is also where the brothers disposed of some of their explosives the night of the manhunt, and it may well have been what brought them to Watertown.

So a connection to the brothers going back years when they worked there, a connection to the 2011 murder, and a connection (however tangential) to the manhunt. Yet (I’m guessing here) any ties the brothers had through that pizza joint would not show up in the dragnet collected precisely for that purpose, because such data is purged because normally pizza joints don’t reflect a meaningful relationship.
Here's where the NSA's collection activities become a damned-if-you-do, damned-if-you-don't situation. Leave the pizza places in and everyone is linked to terrorists. Take them out and you delete helpful connections. The agency will probably point to the need to access more data, in order to somehow further filter the previously collected data. It has most likely already devoted several million dollars towards solving this conundrum -- more analysts, more tools, more data. The one thing it hasn't considered, apparently, is the simplest solution: targeted collections.
[B]ecause this was a dragnet, rather than a collection of the brothers’ calls, this pizza connection may have been hidden entirely in the data.
The continuous, ever-increasing flow of data into the NSA's haystacks has just as much of a chance to bury useful connections as it has to bring them to light. Intelligence agencies don't care much for targeted data acquisition, preferring to pick it up in bulk "just in case."

It's as though the collection of data is its own end. I suppose the only "fortunate" aspect of this dragnet is that its occurring in a digital age, thus keeping the NSA's data centers from looking like interior shots of a particularly horrific episode of "Hoarders." The theory is that this will prevent terrorist attacks. But in practice, it keeps looking as if our intelligence agencies could be just as ineffective with half the data.



Reader Comments (rss)

(Flattened / Threaded)

  •  
    icon
    Ninja (profile), Sep 10th, 2013 @ 7:59am

    Which means that I can easily disguise my connections by opening a pizza shop and using it as my Al-Qaeda franchise while selling pizza to unsuspecting innocent citizens. See, use the noise to disguise yourself.

    Next in the news: FBI marks pizza shops as probable terrorism dens. Along with mosques and pet shops.

     

    reply to this | link to this | view in chronology ]

    •  
      icon
      Mark Harrill (profile), Sep 10th, 2013 @ 8:41am

      Re:

      Pizza Hut, Dominos or Papa Johns?

       

      reply to this | link to this | view in chronology ]

    •  
      identicon
      Anonymous Coward, Sep 10th, 2013 @ 9:56am

      Re:

      Clearly you are lacking in imagination. The obvious move is to mark businesses with US addresses as potential dens of terrorist infiltrators. Then you also mark all businesses with non-US addresses as terrorist dens. It's the only way to be sure you are collecting all of the data on possible terrorists.

      Also since we are spying on people with connections to terrorists we can't ignore the phone companies as sources of linkages. Terrorist use AT&T, AT&T has a business relationship with Verizon, therefor all Verizon customers are linked to terrorists with 3 links or less.

      I feel dirty just for writing that, I wish it wasn't so representative of how the government appears to be thinking.

       

      reply to this | link to this | view in chronology ]

  •  
    identicon
    DCL, Sep 10th, 2013 @ 9:00am

    Insipiration..

    Somebody high up in the NSA watched the movie A Beautiful Mind and thought the shed was "soo cool" and that it was a good model for showing connections.

     

    reply to this | link to this | view in chronology ]

    •  
      icon
      Rikuo (profile), Sep 10th, 2013 @ 11:00am

      Re: Insipiration..

      I wonder though, would it explain a lot about the NSA's behaviour if like the dramatised version of John Nash, the analysts just stare at a wall of dots for hours on end to try and find the commie nukes...I mean Al'Qaeda terrorists?

       

      reply to this | link to this | view in chronology ]

  •  
    identicon
    Anonymous Coward, Sep 10th, 2013 @ 9:03am

    The NSA & too much data

    What the NSA really needs is an Enterprise Version 12 Step Program for Data Addiction

     

    reply to this | link to this | view in chronology ]

  •  
    icon
    Mega1987 (profile), Sep 10th, 2013 @ 9:14am

    graphing the relationship...

    Looks like NSA is attempting to make the Biggest Relationship diagram ever made in human history...

    With over 6 BILLION people to cross-reference and relate to each other...

    well... it's better off to analyse a relationship diagram if Ranma 1/2 or Negima! than what NSA is planning to make...

    But one thing is for sure.... there's alot of people will have an "Annoyed at" relationship at anyone under NSA right now with this mass relationship diagram making...

    They should have tried graphing their own ancestry instead of this...

     

    reply to this | link to this | view in chronology ]

  •  
    identicon
    Anonymous Coward, Sep 10th, 2013 @ 9:32am

    Wonderful News!

    Our privacy is being protected not by law or reason, but because an agency dedicated to finding needles is building bigger and bigger haystacks!

     

    reply to this | link to this | view in chronology ]

  •  
    identicon
    Alt0, Sep 10th, 2013 @ 9:34am

    No Restaurants?

    Well how naive. Organized crime over the years has depended on just this type of place to conduct nefarious operations.
    The "mob" used Italian Restaurants exclusively! I guess no one in the NSA watched the Sopranos.

     

    reply to this | link to this | view in chronology ]

  •  
    icon
    radarmonkey (profile), Sep 10th, 2013 @ 9:49am

    Statistics 101, people!

    Repeat after me: "Correlation does not imply causation!"

     

    reply to this | link to this | view in chronology ]

  •  
    identicon
    Anonymous Coward, Sep 10th, 2013 @ 9:55am

    Six Degrees of Separation

    The whole point of the small network model was to point out that it is easy to connect any two people through a small number of people to people relationships. However, unless a direct contact can be shown between two people, even having a contact in common does not mean they have any contact with each other, or any common goals.
    Keith Alexander does not appear to appreciate this, and this make his approach to using data very very dangerous.

     

    reply to this | link to this | view in chronology ]

    •  
      identicon
      Lord Binky, Sep 10th, 2013 @ 11:50am

      Re: Six Degrees of Separation

      The majority of people can be connected within 2-3 degrees of separation. It isn't useful unless you limit the form of contact or the connection between people. Otherwise it is a novelty.

       

      reply to this | link to this | view in chronology ]

  •  
    icon
    EvilBill (profile), Sep 10th, 2013 @ 10:00am

    Next on Hoarders...

     

    reply to this | link to this | view in chronology ]

  •  
    icon
    Uriel-238 (profile), Sep 10th, 2013 @ 10:34am

    The Pizza Connection is real.

    You don't get it, man. It is the pizzerias. IT'S THE PIZZERIAS!

     

    reply to this | link to this | view in chronology ]

  •  
    icon
    Kal Zekdor (profile), Sep 10th, 2013 @ 11:02am

    Not sure exactly what method the NSA is using, but you can't just look at people who have a shallow connection to a known criminal and expect to get any meaningful results.

    Even direct in person connections are usually completely innocuous, e.g. a neighbor, college roommate, brother or sister, etc.

    At best you may find a "potential" criminal by cross-checking connectivity maps of two or more known criminals (especially those who aren't directly connected to each other). Someone who is closely connected (within 2 jumps, not 3) to multiple known criminals would be rather suspect, and may warrant further (non-intrusive) investigation. I say non-intrusive because we do not (ostensibly) believe in guilt by association. Innocent until proven guilty, and all that good stuff.

    Then again, I'm not sure I should be giving the NSA tips on how to use the mountains of data they've illegally (or, at least, unethically) obtained. I've always had a fascination with large data sets, though. In another reality, it's possible I could have been working for the NSA on just that sort of thing (and hopefully have had the courage to pull a Snowden).

     

    reply to this | link to this | view in chronology ]

  •  
    icon
    Zangetsu (profile), Sep 10th, 2013 @ 11:34am

    Nintendo FTW

    Welcome to the Pokemon generation. The analysts grew up with the mentality that they needed to catch all of the pokemon in order to "win". Now they want to catch all of the data in order to "win". Anxiously looking forward to seeing what happens when the Grand Theft Auto generation is in charge.

     

    reply to this | link to this | view in chronology ]

  •  
    identicon
    Anonymous Coward, Sep 10th, 2013 @ 11:55am

    /s

    What's next? Daily semen samples?

     

    reply to this | link to this | view in chronology ]

  •  
    identicon
    Lord Binky, Sep 10th, 2013 @ 11:57am

    This would ALL be resolved if they were require to pass a university level machine learning course. This kind of crap gets cleared up when you understand the limitations of either classification (supervised or unsupervised). Present this guy with such classics as the Iris data set, and he would have a mental breakdown.

     

    reply to this | link to this | view in chronology ]

  •  
    icon
    Hephaestus (profile), Sep 10th, 2013 @ 12:33pm

    "If the NSA is now stripping out eateries as possible connectors, it could very well be filtering out links to terrorists."

    The patterns between any two events will by subtly different. Those subtle differences will lead to finding more than you can handle or missing what you need.

     

    reply to this | link to this | view in chronology ]

  •  
    identicon
    New Mexico Mark, Sep 10th, 2013 @ 1:41pm

    Why we'll never have real artificial intelligence

    Obviously, a "find bad stuff" button is badly needed. Unfortunately, if you started inventing systems with enough intelligence do actually do this, the "bad stuff" flagged would be organizations like the NSA and politicians who create hostile/wasteful laws.

    Hmmm.... Maybe that is a key reason why AI research is stalled right now. "What were you thinking? We didn't mean REAL intelligence! Your research funding has been cut until we get more positive (to us) results."

     

    reply to this | link to this | view in chronology ]

    •  
      identicon
      New Mexico Mark, Sep 10th, 2013 @ 2:03pm

      Re: Why we'll never have real artificial intelligence

      Actually, real analysts live in two worlds.

      The first world is the complex, scary wold of big data where every "conclusion" is actually just the first piece of a larger puzzle and must be tested several different ways before being given any credibility. Almost everything is just shades of gray.

      The second world is that of producing pictures and useless "find bad stuff instantly" tools for executives and tourists who can't be troubled to think and won't accept that this is an impossible goal. The best outcome a good analyst can hope for is that no one treats the pictures or buttons as actionable information.

       

      reply to this | link to this | view in chronology ]

  •  
    icon
    Uriel-238 (profile), Sep 10th, 2013 @ 4:48pm

    Fixed

    Why we'll never have magical artificial intelligence.

    In strategy games, playing against computer AI, I noticed this thing that if I built a closed castle, the opposing armies would bring siege engines to breach the gates or the walls (whichever was weakest). Yet, if I left the gates open and turned the courtyard into a killzone, they'd happily rush their armies in to get mulched.

    It turns out that humans (real intelligence) often make this mistake as well.

    Similarly, we'll never have artificial intelligence that can discern terrorist activity from benign communication because real intelligence cannot agree which is which, much in the way judges cannot discern when erotic artistic media ends and porn begins.

    The New York Times crossword puzzle designer was busted by the government for (coincidentally) adding too many code-words from Operation Overlord into the puzzle.

     

    reply to this | link to this | view in chronology ]

  •  
    identicon
    Anonymous Coward, Sep 11th, 2013 @ 1:04am

    The spying isn't about catching terrorists. The spying is about attempting to control the world through political blackmail, corporate espionage, and oppressing dissident movements.

    Refer to the spying on Brazil's president, Brazil's largest oil corporation, Bradley Manning, and Edward Snowden's exile in Russia as proof of all three.

    There's many more examples of course. Edward Snowden understood what the oppressive global spying apparatus is really about.

    He did his best to steer humanity away from it's corrupt iron grip. We should attempt to do the same. Otherwise freedom will be lost, possibly forever.

     

    reply to this | link to this | view in chronology ]


Add Your Comment

Have a Techdirt Account? Sign in now. Want one? Register here
Get Techdirt’s Daily Email
Save me a cookie
  • Note: A CRLF will be replaced by a break tag (<br>), all other allowable HTML will remain intact
  • Allowed HTML Tags: <b> <i> <a> <em> <br> <strong> <blockquote> <hr> <tt>
Follow Techdirt
A word from our sponsors...
Essential Reading
Techdirt Reading List
Techdirt Insider Chat
A word from our sponsors...
Recent Stories
A word from our sponsors...

Close

Email This