The Problem With Too Much Data: Mistaking The Signal For The Noise

from the quantity-over-quality dept

The NSA can't get enough data, as is evidenced by its shiny, new data center and its multiple efforts to either bypass laws entirely or have them rewritten in its favor. General Alexander, in particular, wants all the data. Everything. And as Mike covered earlier, he's not shy about grabbing the data first and worrying about the legality later.

In his enthusiastic pursuit for more data, Alexander seems to have bypassed any sort of confirmation that adding more data is helpful. Here's one issue the indiscriminate data harvesting raised.

“He had all these diagrams showing how this guy was connected to that guy and to that guy,” says a former NSA official who heard Alexander give briefings on the floor of the Information Dominance Center. “Some of my colleagues and I were skeptical. Later, we had a chance to review the information. It turns out that all [that] those guys were connected to were pizza shops.”
Tons of noise, or rather, tons of dots, the kind intelligence leaders seem to believe we're still short on. Alexander certainly liked connecting dots, but seemed unconcerned if the resulting picture was completely unintelligible.
Under Alexander's leadership, one of the agency's signature analysis tools was a digital graph that showed how hundreds, sometimes thousands, of people, places, and events were connected to each other. They were displayed as a tangle of dots and lines. Critics called it the BAG -- for "big ass graph" -- and said it produced very few useful leads.
When you have tons of data, you have to filter out the noise if you're going to use it any meaningful way. Alexander may have learned from the previous experience that while many terrorists may purchase pizzas, not everyone who purchases pizza is a terrorist. Hence the first level of "auditing," as Marcy Wheeler points out at emptywheel.
As I noted last month, the NSA’s primary order for the Section 215 program allows for technical personnel to access the data, in unaudited form, before the analysts get to it. They do so to identify “high volume identifiers” (and other “unwanted BR metadata”). As I said, I suspect they’re stripping the dataset of numbers that would otherwise distort contact chaining.

I suspect a lot of what these technical personnel are doing is stripping numbers — probably things like telemarketer numbers — that would otherwise distort the contact chaining... I used telemarketers, but Alexander himself has used the example of the pizza joint in testimony.

In other words, it appears Alexander learned from his mistake at INSCOM that pizza joints do not actually represent a meaningful connection. His use of the example seems to suggest that NSA now strips pizza joints from their dataset.
Separating the signal from the noise is the first step for working with any large data set. But the NSA's separation step operates under the assumption that every number with an inordinate number of hits is just noise. If the NSA is now stripping out eateries as possible connectors, it could very well be filtering out links to terrorists. Wheeler goes back through the series of missed connections by intelligence and law enforcement agencies that were uncovered after the Boston bombing.
I also suspect there may be one gaping hole in the NSA’s data relating to the Tsarnaevs: any calls and connections through Gerry’s Italian Kitchen.

Gerry’s was, if you recall, the pizza joint involved in the 2011 murder in Waltham: the three men were killed sometime between ordering a pizza and its delivery 45 minutes later. I’ve been told both Tsarnaevs had delivered pizza for that restaurant before then and Tamerlan may still have been.

But Gerry’s is also where the brothers disposed of some of their explosives the night of the manhunt, and it may well have been what brought them to Watertown.

So a connection to the brothers going back years when they worked there, a connection to the 2011 murder, and a connection (however tangential) to the manhunt. Yet (I’m guessing here) any ties the brothers had through that pizza joint would not show up in the dragnet collected precisely for that purpose, because such data is purged because normally pizza joints don’t reflect a meaningful relationship.
Here's where the NSA's collection activities become a damned-if-you-do, damned-if-you-don't situation. Leave the pizza places in and everyone is linked to terrorists. Take them out and you delete helpful connections. The agency will probably point to the need to access more data, in order to somehow further filter the previously collected data. It has most likely already devoted several million dollars towards solving this conundrum -- more analysts, more tools, more data. The one thing it hasn't considered, apparently, is the simplest solution: targeted collections.
[B]ecause this was a dragnet, rather than a collection of the brothers’ calls, this pizza connection may have been hidden entirely in the data.
The continuous, ever-increasing flow of data into the NSA's haystacks has just as much of a chance to bury useful connections as it has to bring them to light. Intelligence agencies don't care much for targeted data acquisition, preferring to pick it up in bulk "just in case."

It's as though the collection of data is its own end. I suppose the only "fortunate" aspect of this dragnet is that its occurring in a digital age, thus keeping the NSA's data centers from looking like interior shots of a particularly horrific episode of "Hoarders." The theory is that this will prevent terrorist attacks. But in practice, it keeps looking as if our intelligence agencies could be just as ineffective with half the data.


Reader Comments

Subscribe: RSS

View by: Time | Thread


  • icon
    Ninja (profile), 10 Sep 2013 @ 7:59am

    Which means that I can easily disguise my connections by opening a pizza shop and using it as my Al-Qaeda franchise while selling pizza to unsuspecting innocent citizens. See, use the noise to disguise yourself.

    Next in the news: FBI marks pizza shops as probable terrorism dens. Along with mosques and pet shops.

    reply to this | link to this | view in chronology ]

  • identicon
    DCL, 10 Sep 2013 @ 9:00am

    Insipiration..

    Somebody high up in the NSA watched the movie A Beautiful Mind and thought the shed was "soo cool" and that it was a good model for showing connections.

    reply to this | link to this | view in chronology ]

    • icon
      Rikuo (profile), 10 Sep 2013 @ 11:00am

      Re: Insipiration..

      I wonder though, would it explain a lot about the NSA's behaviour if like the dramatised version of John Nash, the analysts just stare at a wall of dots for hours on end to try and find the commie nukes...I mean Al'Qaeda terrorists?

      reply to this | link to this | view in chronology ]

  • identicon
    Anonymous Coward, 10 Sep 2013 @ 9:03am

    The NSA & too much data

    What the NSA really needs is an Enterprise Version 12 Step Program for Data Addiction

    reply to this | link to this | view in chronology ]

  • icon
    Mega1987 (profile), 10 Sep 2013 @ 9:14am

    graphing the relationship...

    Looks like NSA is attempting to make the Biggest Relationship diagram ever made in human history...

    With over 6 BILLION people to cross-reference and relate to each other...

    well... it's better off to analyse a relationship diagram if Ranma 1/2 or Negima! than what NSA is planning to make...

    But one thing is for sure.... there's alot of people will have an "Annoyed at" relationship at anyone under NSA right now with this mass relationship diagram making...

    They should have tried graphing their own ancestry instead of this...

    reply to this | link to this | view in chronology ]

  • identicon
    Anonymous Coward, 10 Sep 2013 @ 9:32am

    Wonderful News!

    Our privacy is being protected not by law or reason, but because an agency dedicated to finding needles is building bigger and bigger haystacks!

    reply to this | link to this | view in chronology ]

  • identicon
    Alt0, 10 Sep 2013 @ 9:34am

    No Restaurants?

    Well how naive. Organized crime over the years has depended on just this type of place to conduct nefarious operations.
    The "mob" used Italian Restaurants exclusively! I guess no one in the NSA watched the Sopranos.

    reply to this | link to this | view in chronology ]

  • icon
    radarmonkey (profile), 10 Sep 2013 @ 9:49am

    Statistics 101, people!

    Repeat after me: "Correlation does not imply causation!"

    reply to this | link to this | view in chronology ]

  • identicon
    Anonymous Coward, 10 Sep 2013 @ 9:55am

    Six Degrees of Separation

    The whole point of the small network model was to point out that it is easy to connect any two people through a small number of people to people relationships. However, unless a direct contact can be shown between two people, even having a contact in common does not mean they have any contact with each other, or any common goals.
    Keith Alexander does not appear to appreciate this, and this make his approach to using data very very dangerous.

    reply to this | link to this | view in chronology ]

    • identicon
      Lord Binky, 10 Sep 2013 @ 11:50am

      Re: Six Degrees of Separation

      The majority of people can be connected within 2-3 degrees of separation. It isn't useful unless you limit the form of contact or the connection between people. Otherwise it is a novelty.

      reply to this | link to this | view in chronology ]

  • icon
    EvilBill (profile), 10 Sep 2013 @ 10:00am

    Next on Hoarders...

    reply to this | link to this | view in chronology ]

  • icon
    Uriel-238 (profile), 10 Sep 2013 @ 10:34am

    The Pizza Connection is real.

    You don't get it, man. It is the pizzerias. IT'S THE PIZZERIAS!

    reply to this | link to this | view in chronology ]

  • icon
    Kal Zekdor (profile), 10 Sep 2013 @ 11:02am

    Not sure exactly what method the NSA is using, but you can't just look at people who have a shallow connection to a known criminal and expect to get any meaningful results.

    Even direct in person connections are usually completely innocuous, e.g. a neighbor, college roommate, brother or sister, etc.

    At best you may find a "potential" criminal by cross-checking connectivity maps of two or more known criminals (especially those who aren't directly connected to each other). Someone who is closely connected (within 2 jumps, not 3) to multiple known criminals would be rather suspect, and may warrant further (non-intrusive) investigation. I say non-intrusive because we do not (ostensibly) believe in guilt by association. Innocent until proven guilty, and all that good stuff.

    Then again, I'm not sure I should be giving the NSA tips on how to use the mountains of data they've illegally (or, at least, unethically) obtained. I've always had a fascination with large data sets, though. In another reality, it's possible I could have been working for the NSA on just that sort of thing (and hopefully have had the courage to pull a Snowden).

    reply to this | link to this | view in chronology ]

  • icon
    Zangetsu (profile), 10 Sep 2013 @ 11:34am

    Nintendo FTW

    Welcome to the Pokemon generation. The analysts grew up with the mentality that they needed to catch all of the pokemon in order to "win". Now they want to catch all of the data in order to "win". Anxiously looking forward to seeing what happens when the Grand Theft Auto generation is in charge.

    reply to this | link to this | view in chronology ]

  • identicon
    Anonymous Coward, 10 Sep 2013 @ 11:55am

    /s

    What's next? Daily semen samples?

    reply to this | link to this | view in chronology ]

  • identicon
    Lord Binky, 10 Sep 2013 @ 11:57am

    This would ALL be resolved if they were require to pass a university level machine learning course. This kind of crap gets cleared up when you understand the limitations of either classification (supervised or unsupervised). Present this guy with such classics as the Iris data set, and he would have a mental breakdown.

    reply to this | link to this | view in chronology ]

  • icon
    Hephaestus (profile), 10 Sep 2013 @ 12:33pm

    "If the NSA is now stripping out eateries as possible connectors, it could very well be filtering out links to terrorists."

    The patterns between any two events will by subtly different. Those subtle differences will lead to finding more than you can handle or missing what you need.

    reply to this | link to this | view in chronology ]

  • identicon
    New Mexico Mark, 10 Sep 2013 @ 1:41pm

    Why we'll never have real artificial intelligence

    Obviously, a "find bad stuff" button is badly needed. Unfortunately, if you started inventing systems with enough intelligence do actually do this, the "bad stuff" flagged would be organizations like the NSA and politicians who create hostile/wasteful laws.

    Hmmm.... Maybe that is a key reason why AI research is stalled right now. "What were you thinking? We didn't mean REAL intelligence! Your research funding has been cut until we get more positive (to us) results."

    reply to this | link to this | view in chronology ]

    • identicon
      New Mexico Mark, 10 Sep 2013 @ 2:03pm

      Re: Why we'll never have real artificial intelligence

      Actually, real analysts live in two worlds.

      The first world is the complex, scary wold of big data where every "conclusion" is actually just the first piece of a larger puzzle and must be tested several different ways before being given any credibility. Almost everything is just shades of gray.

      The second world is that of producing pictures and useless "find bad stuff instantly" tools for executives and tourists who can't be troubled to think and won't accept that this is an impossible goal. The best outcome a good analyst can hope for is that no one treats the pictures or buttons as actionable information.

      reply to this | link to this | view in chronology ]

  • icon
    Uriel-238 (profile), 10 Sep 2013 @ 4:48pm

    Fixed

    Why we'll never have magical artificial intelligence.

    In strategy games, playing against computer AI, I noticed this thing that if I built a closed castle, the opposing armies would bring siege engines to breach the gates or the walls (whichever was weakest). Yet, if I left the gates open and turned the courtyard into a killzone, they'd happily rush their armies in to get mulched.

    It turns out that humans (real intelligence) often make this mistake as well.

    Similarly, we'll never have artificial intelligence that can discern terrorist activity from benign communication because real intelligence cannot agree which is which, much in the way judges cannot discern when erotic artistic media ends and porn begins.

    The New York Times crossword puzzle designer was busted by the government for (coincidentally) adding too many code-words from Operation Overlord into the puzzle.

    reply to this | link to this | view in chronology ]

  • identicon
    Anonymous Coward, 11 Sep 2013 @ 1:04am

    The spying isn't about catching terrorists. The spying is about attempting to control the world through political blackmail, corporate espionage, and oppressing dissident movements.

    Refer to the spying on Brazil's president, Brazil's largest oil corporation, Bradley Manning, and Edward Snowden's exile in Russia as proof of all three.

    There's many more examples of course. Edward Snowden understood what the oppressive global spying apparatus is really about.

    He did his best to steer humanity away from it's corrupt iron grip. We should attempt to do the same. Otherwise freedom will be lost, possibly forever.

    reply to this | link to this | view in chronology ]

  • identicon
    Ajaxn, 17 Feb 2017 @ 9:07am

    Time and Timing

    Here's a thought.

    Everyone in this thread could be seen as connected, even though we've never met, or spoken to each other. We now share this meta connection.

    Ditto everyone who has ever clicked a url, say as an entry in the mother of all meta nodes - google search, to read an article printed by this web site.

    Then there would be our online 'Trolls' to consider. Trolls could be seen as 'meta nodes'. Anyone who has ever encountered these online trolls will know they often have very wide agendas. Which means as nodes, anyone they target could be treated as if they were connected, just by virtue of who they have in common.

    Meantime these trolls as meta nodes and functionaries of this system, would remain invisible as the cause or context for those connections.

    Connections alone wont tell the whole story, you would have to look at the frequency of those connections, as well as the context of those connections in order to judge the relevance of that information.

    As is often the case, we define x by what we seek. Those limited set of attributes defining x, could mean we fail to see other aspects of the information which might contradict our conclusions. In other words our answers are only as good as the questions asked, which are only as good as the attributes of information recorded. A lot of data doesn't means a lot of useful data. Or put another way, some times you want in that data, information which allows you to exclude a particular result.

    reply to this | link to this | view in chronology ]


Add Your Comment

Have a Techdirt Account? Sign in now. Want one? Register here
Get Techdirt’s Daily Email
Use markdown for basic formatting. HTML is no longer supported.
  Save me a cookie
Follow Techdirt
Special Affiliate Offer

Advertisement
Report this ad  |  Hide Techdirt ads
Essential Reading
Techdirt Deals
Report this ad  |  Hide Techdirt ads
Techdirt Insider Chat
Advertisement
Report this ad  |  Hide Techdirt ads
Recent Stories
Advertisement
Report this ad  |  Hide Techdirt ads

Close

Email This

This feature is only available to registered users. Register or sign in to use it.