Harvard Students Again Show 'Anonymized' Data Isn't Really Anonymous

from the I-know-more-about-you-than-you-do dept

As companies and governments increasingly hoover up our personal data, a common refrain to keep people from worrying is the claim that nothing can go wrong because the data itself is “anonymized” — or stripped of personal identifiers like social security numbers. But time and time again, studies have shown how this really is cold comfort, given it takes only a little effort to pretty quickly identify a person based on access to other data sets. Yet most companies, many privacy policy folk, and even government officials still like to act as if “anonymizing” your data means something.

A pair of Harvard students have once again highlighted that it very much doesn’t.

As part of a class study, two Harvard computer scientists built a tool to analyze the thousands of data sets leaked over the last five years or so, ranging from the 2015 hack of Experian, to the countless other privacy scandals that have plagued everyone from social media giants to porn websites. Their tool collected and analyzed all this data, and matched it to existing email addresses across scandals. What they found, again (surprise!) is that anonymized data is in no way anonymous:

“An individual leak is like a puzzle piece,? Harvard researcher Dasha Metropolitansky told Motherboard. ?On its own, it isn?t particularly powerful, but when multiple leaks are brought together, they form a surprisingly clear picture of our identities. People may move on from these leaks, but hackers have long memories.”

?We showed that an ?anonymized? dataset from one place can easily be linked to a non-anonymized dataset from somewhere else via a column that appears in both datasets,? Metropolitansky said. ?So we shouldn?t assume that our personal information is safe just because a company claims to limit how much they collect and store.”

For example, one UK study showed how machine learning could currently identify 99.98% of Americans in an anonymized data set using just 15 characteristics. Another MIT study of “anonymized” credit card user data showed how users could be identified 90% of the time using just points of information. One German study (pdf) looked at how just 15 minutes of brake pedal data could help them identify the right driver, out of 15 potential options, 90% of the time.

Take that data and fuse it with, say… the location data hoovered up by your cell phone provider, or the smart electricity meter data collected by your local power utility, and it’s possible for a hacker, researcher, corporation to build the kind of detailed profiles on your daily movements and habits that even you or your spouse might be surprised by. And since we still don’t have even a basic U.S. privacy law for the internet era, nothing really seems to change, and any penalties for abusing the public trust are, well, routinely pathetic.

Yet somehow, every time there’s another massive new hack or break, the involved companies (as we just saw with the Avast antivirus privacy scandal), like to downplay the threat of the hack or breach by insisting the data collected was anonymized, and therefore there’s just no way the data could help specifically identify or target individuals. There’s simply never been any indication that’s actually true.

Filed Under: , ,

Rate this comment as insightful
Rate this comment as funny
You have rated this comment as insightful
You have rated this comment as funny
Flag this comment as abusive/trolling/spam
You have flagged this comment
The first word has already been claimed
The last word has already been claimed
Insightful Lightbulb icon Funny Laughing icon Abusive/trolling/spam Flag icon Insightful badge Lightbulb icon Funny badge Laughing icon Comments icon

Comments on “Harvard Students Again Show 'Anonymized' Data Isn't Really Anonymous”

Subscribe: RSS Leave a comment
This comment has been deemed insightful by the community.
That Anonymous Coward (profile) says:

I am starting to think they do not know the meaning of the word anonymized.

"but hackers have long memories"
They also have lots of storage space & a need to keep things that might be useful in the future.
I mean its not like I still have that entire dump of ACS when they screwed up and put the whole server backup online…. er wait.

The world seems to keep working under the assumption, no one would ever do that.
No one would ever combine datasets.
No one would ever scrap every picture they could find.
No one would ever give the names, numbers, emails of everyone they know to a platform.
No one would ever build shadow profiles to help find more links between people.
No one would ever use shopping data to send baby coupons.
No one would ever lie online.

Humanity… The Good Intentions of: No one would ever…

Anonymous Coward says:

One German study (pdf) looked at how just 15 minutes of brake pedal data could help them identify the right driver, out of 15 potential options, 90% of the time.

Sure, but there aren’t only 15 drivers out there. Any reasonable sampling of drivers you would want to search through for an individual would include hundreds if not thousands of initial members of the pool.

Anonymous Coward says:

Re: Re: Re:

Ah; but that’s where the point of this (and similar) article comes in.

Using the brake pedal data, you segment the dataset of 30 million down into buckets of, say, 10,000.

Now for each of those buckets, you segment by acceleration data, creating unique buckets of size 10.

The chance that there would be a GPS location or ALPR location collision between those 10 people is extremely low. Meaning you can now fingerprint not just the vehicle and backtrace where it went over a period of time, you know with a very high degree of certainty who was driving that vehicle. All without any visual confirmation of the face behind the wheel.

Anonymous Coward says:

Re: Re: Re: Re:

Imagine, for example, recovering a stolen vehicle, dumping the telemetry, and with only 3 factors or so being able to ID who stole the vehicle and what they did with it.

There’s no central database of vehicle telemetry, so that’s not currently possible — but if you’re an insurance agency with wide coverage and favourable terms for people who share their telemetry… you’ll get an idea pretty quickly.

Also: if a car is ensured pleasure only and shares telemetry, it becomes obvious pretty quickly if extra people are using the vehicle that aren’t on the insurance, even if the car meets the distance criteria.

teka says:

Re: Re:

okay, but lets pick just the cars that leave your work parking lot between 5 and 5:30, and slow, then brake to a stop for a few minutes at the same time and the same days you happen to stop in at 711 for a post-work slurpee according to your credit card data and then drive enough (time or distance) to reach your home minutes before your smart power meter shows that electricity consumption increases as you turn on your oven to start preheating for dinner. keep plucking out data points that could escape and you fill in more and more gaps.

That One Guy (profile) says:

And yet...

I all but guarantee you that the people/companies putting forth the ‘it’s harmless data collection, it’s been anonymized’ would refuse point-blank were someone to ask them to provide their ‘anonymized’ data to pour through.

They know damn well the ‘we can’t identify people with this data’ excuse is a lie, they’re just hoping that the people they’re talking to don’t know that, or have a vested interest in perpetuating the lie.

Add Your Comment

Your email address will not be published. Required fields are marked *

Have a Techdirt Account? Sign in now. Want one? Register here

Comment Options:

Make this the or (get credits or sign in to see balance) what's this?

What's this?

Techdirt community members with Techdirt Credits can spotlight a comment as either the "First Word" or "Last Word" on a particular comment thread. Credits can be purchased at the Techdirt Insider Shop »

Follow Techdirt

Techdirt Daily Newsletter

Techdirt Deals
Techdirt Insider Discord
The latest chatter on the Techdirt Insider Discord channel...