Harvard Students Again Show 'Anonymized' Data Isn't Really Anonymous

from the I-know-more-about-you-than-you-do dept

As companies and governments increasingly hoover up our personal data, a common refrain to keep people from worrying is the claim that nothing can go wrong because the data itself is "anonymized" -- or stripped of personal identifiers like social security numbers. But time and time again, studies have shown how this really is cold comfort, given it takes only a little effort to pretty quickly identify a person based on access to other data sets. Yet most companies, many privacy policy folk, and even government officials still like to act as if "anonymizing" your data means something.

A pair of Harvard students have once again highlighted that it very much doesn't.

As part of a class study, two Harvard computer scientists built a tool to analyze the thousands of data sets leaked over the last five years or so, ranging from the 2015 hack of Experian, to the countless other privacy scandals that have plagued everyone from social media giants to porn websites. Their tool collected and analyzed all this data, and matched it to existing email addresses across scandals. What they found, again (surprise!) is that anonymized data is in no way anonymous:

"An individual leak is like a puzzle piece,” Harvard researcher Dasha Metropolitansky told Motherboard. “On its own, it isn’t particularly powerful, but when multiple leaks are brought together, they form a surprisingly clear picture of our identities. People may move on from these leaks, but hackers have long memories."

“We showed that an ‘anonymized’ dataset from one place can easily be linked to a non-anonymized dataset from somewhere else via a column that appears in both datasets,” Metropolitansky said. “So we shouldn’t assume that our personal information is safe just because a company claims to limit how much they collect and store."

For example, one UK study showed how machine learning could currently identify 99.98% of Americans in an anonymized data set using just 15 characteristics. Another MIT study of "anonymized" credit card user data showed how users could be identified 90% of the time using just points of information. One German study (pdf) looked at how just 15 minutes of brake pedal data could help them identify the right driver, out of 15 potential options, 90% of the time.

Take that data and fuse it with, say... the location data hoovered up by your cell phone provider, or the smart electricity meter data collected by your local power utility, and it's possible for a hacker, researcher, corporation to build the kind of detailed profiles on your daily movements and habits that even you or your spouse might be surprised by. And since we still don't have even a basic U.S. privacy law for the internet era, nothing really seems to change, and any penalties for abusing the public trust are, well, routinely pathetic.

Yet somehow, every time there's another massive new hack or break, the involved companies (as we just saw with the Avast antivirus privacy scandal), like to downplay the threat of the hack or breach by insisting the data collected was anonymized, and therefore there's just no way the data could help specifically identify or target individuals. There's simply never been any indication that's actually true.

Filed Under: anonymity, anonymous data, study


Reader Comments

Subscribe: RSS

View by: Time | Thread


  • identicon
    Anonymous Coward, 10 Feb 2020 @ 6:33am

    My phone sleeps with my wife's phone.

    reply to this | link to this | view in chronology ]

  • icon
    That Anonymous Coward (profile), 10 Feb 2020 @ 6:37am

    I am starting to think they do not know the meaning of the word anonymized.

    "but hackers have long memories"
    They also have lots of storage space & a need to keep things that might be useful in the future.
    I mean its not like I still have that entire dump of ACS when they screwed up and put the whole server backup online.... er wait.

    The world seems to keep working under the assumption, no one would ever do that.
    No one would ever combine datasets.
    No one would ever scrap every picture they could find.
    No one would ever give the names, numbers, emails of everyone they know to a platform.
    No one would ever build shadow profiles to help find more links between people.
    No one would ever use shopping data to send baby coupons.
    No one would ever lie online.

    Humanity... The Good Intentions of: No one would ever...

    reply to this | link to this | view in chronology ]

  • icon
    jdesa (profile), 10 Feb 2020 @ 7:04am

    Companies running the internet all became billionaires because they hoover up our personal data. The government that could fix this, is currently run by money & not the public interest.

    Glad we have atleast GDPR ruling.

    reply to this | link to this | view in chronology ]

  • identicon
    Anonymous Coward, 10 Feb 2020 @ 7:19am

    One German study (pdf) looked at how just 15 minutes of brake pedal data could help them identify the right driver, out of 15 potential options, 90% of the time.

    Sure, but there aren't only 15 drivers out there. Any reasonable sampling of drivers you would want to search through for an individual would include hundreds if not thousands of initial members of the pool.

    reply to this | link to this | view in chronology ]

    • identicon
      Anonymous Coward, 10 Feb 2020 @ 7:25am

      Re:

      Yes - and in addition, I doubt that variations in brake pedal application of one driver is unique enough to pick it out of millions.

      reply to this | link to this | view in chronology ]

      • identicon
        Anonymous Coward, 10 Feb 2020 @ 1:26pm

        Re: Re:

        Ah; but that's where the point of this (and similar) article comes in.

        Using the brake pedal data, you segment the dataset of 30 million down into buckets of, say, 10,000.

        Now for each of those buckets, you segment by acceleration data, creating unique buckets of size 10.

        The chance that there would be a GPS location or ALPR location collision between those 10 people is extremely low. Meaning you can now fingerprint not just the vehicle and backtrace where it went over a period of time, you know with a very high degree of certainty who was driving that vehicle. All without any visual confirmation of the face behind the wheel.

        reply to this | link to this | view in chronology ]

        • identicon
          Anonymous Coward, 10 Feb 2020 @ 1:32pm

          Re: Re: Re:

          Imagine, for example, recovering a stolen vehicle, dumping the telemetry, and with only 3 factors or so being able to ID who stole the vehicle and what they did with it.

          There's no central database of vehicle telemetry, so that's not currently possible -- but if you're an insurance agency with wide coverage and favourable terms for people who share their telemetry... you'll get an idea pretty quickly.

          Also: if a car is ensured pleasure only and shares telemetry, it becomes obvious pretty quickly if extra people are using the vehicle that aren't on the insurance, even if the car meets the distance criteria.

          reply to this | link to this | view in chronology ]

          • identicon
            Anonymous Coward, 10 Feb 2020 @ 2:32pm

            Re: Re: Re: Re:

            And now imagine becoming a serf for the corporations, because they control almost everything you need to live as part of society.

            reply to this | link to this | view in chronology ]

    • identicon
      teka, 10 Feb 2020 @ 7:48am

      Re:

      okay, but lets pick just the cars that leave your work parking lot between 5 and 5:30, and slow, then brake to a stop for a few minutes at the same time and the same days you happen to stop in at 711 for a post-work slurpee according to your credit card data and then drive enough (time or distance) to reach your home minutes before your smart power meter shows that electricity consumption increases as you turn on your oven to start preheating for dinner. keep plucking out data points that could escape and you fill in more and more gaps.

      reply to this | link to this | view in chronology ]

  • icon
    hij (profile), 10 Feb 2020 @ 9:08am

    the power of correlation

    Correlation may not imply causation, but the associations are enough for google to earn billions per year.

    reply to this | link to this | view in chronology ]

    • identicon
      Anonymous Coward, 10 Feb 2020 @ 10:35am

      Re: the power of correlation

      A small correction, the marketeers belief in correlation and profiling allow Google to earn billions per year.

      reply to this | link to this | view in chronology ]

  • identicon
    Anonymous Coward, 10 Feb 2020 @ 9:09am

    we all know

    that 'online anonymity is a fallacy. It's all about control.
    Is tech moving toward or away from allowing one person in power to monitor almost everyone?

    reply to this | link to this | view in chronology ]

  • icon
    That One Guy (profile), 10 Feb 2020 @ 10:41am

    And yet...

    I all but guarantee you that the people/companies putting forth the 'it's harmless data collection, it's been anonymized' would refuse point-blank were someone to ask them to provide their 'anonymized' data to pour through.

    They know damn well the 'we can't identify people with this data' excuse is a lie, they're just hoping that the people they're talking to don't know that, or have a vested interest in perpetuating the lie.

    reply to this | link to this | view in chronology ]

    • identicon
      anny, 10 Feb 2020 @ 1:43pm

      Re: And yet...

      Actually, one reporter did that, and the researchers were able to determine, without his assisance:
      1) where he worked
      2) where he lived
      3) where he got gas
      4) where he shopped for groceries
      5) that he was married or in a relationship
      and the list of things they were able to determine went on and on.

      reply to this | link to this | view in chronology ]

    • identicon
      Anonymous Coward, 11 Feb 2020 @ 3:30am

      Re: And yet...

      Indeed.

      Similarly, people who advocate for war should have their children fight on the front lines in said war.

      reply to this | link to this | view in chronology ]

  • identicon
    Anonymous Coward, 10 Feb 2020 @ 12:15pm

    The anonymized data problem stems from an internet backbone that was designed to not be anonymous.

    reply to this | link to this | view in chronology ]


Add Your Comment

Have a Techdirt Account? Sign in now. Want one? Register here



Subscribe to the Techdirt Daily newsletter




Comment Options:

  • Use markdown. Use plain text.
  • Remember name/email/url (set a cookie)

Close

Add A Reply

Have a Techdirt Account? Sign in now. Want one? Register here



Subscribe to the Techdirt Daily newsletter




Comment Options:

  • Use markdown. Use plain text.
  • Remember name/email/url (set a cookie)

Follow Techdirt
Techdirt Gear
Shop Now: I Invented Email
Advertisement
Report this ad  |  Hide Techdirt ads
Essential Reading
Techdirt Deals
Report this ad  |  Hide Techdirt ads
Techdirt Insider Chat
Advertisement
Report this ad  |  Hide Techdirt ads
Recent Stories
Advertisement
Report this ad  |  Hide Techdirt ads

Close

Email This

This feature is only available to registered users. Register or sign in to use it.