One More Time With Feeling: 'Anonymized' User Data Not Really Anonymous

from the we-can-see-you dept

As companies and governments increasingly hoover up our personal data, a common refrain to keep people from worrying is the claim that nothing can go wrong — because the data itself is “anonymized” — or stripped of personal detail. But time and time again, we’ve noted how this really is cold comfort; given it takes only a little effort to pretty quickly identify a person based on access to other data sets. As cellular carriers in particular begin to collect every shred of browsing and location data, identifying “anonymized” data using just a little additional context has become arguably trivial.

Researchers from Stanford and Princeton universities plan to make this point once again via a new study being presented at the World Wide Web Conference in Perth, Australia this upcoming April. According to this new study, browsing habits can be easily linked to social media profiles to quickly identify users. In fact, using data from roughly 400 volunteers, the researchers found that they could identify the person behind an “anonymized” data set 70% of the time just by comparing their browsing data to their social media activity:

“The programs were able to find patterns among the different groups of data and use those patterns to identify users. The researchers note that the method is not perfect, and it requires a social media feed that includes a number of links to outside sites. However, they said that “given a history with 30 links originating from Twitter, we can deduce the corresponding Twitter profile more than 50 percent of the time.”

The researchers had even greater success in an experiment they ran involving 374 volunteers who submitted web browsing information. The researchers were able to identify more than 70 percent of those users by comparing their web browsing data to hundreds of millions of public social media feeds.

Of course, with the sophistication of online tracking and behavior ad technology, this shouldn’t be particularly surprising. Numerous researchers likewise have noted it’s relatively simple to build systems that identify users with just a little additional context. That, of course, raises questions about how much protection “anonymizing” data actually has in both business practice, and should this data be hacked and released in the wild:

“Yves-Alexandre de Montjoye, an assistant professor at Imperial College London, said the research shows how “easy it is to build a full-scale ‘de-anonymizationer’ that needs nothing more than what’s available to anyone who knows how to code.” “All the evidence we have seen piling up over the years showing the strong limits of data anonymization, including this study, really emphasizes the need to rethink our approach to privacy and data protection in the age of big data,” said de Montjoye.

And this doesn’t even factor in how new technologies — like Verizon’s manipulation of user data packets — allow companies to build sophisticated new profiles based on the combination of browsing data, location data, and modifying packet headers. The FCC’s recently-passed broadband privacy rules were designed in part to acknowledge these new efforts, by allowing user data collection — but only if this data was “not reasonably linkable” to individual users. But once you realize that all data — “anonymized” or not — is linkable to individual users, such a distinction becomes wholly irrelevant.

One of the study’s authors, Princeton researcher Arvind Narayanan, has been warning that anonymous data isn’t really anonymous for the better part of the last decade, yet it’s not entirely clear when we intend to actually hear — and understand — his message.

Filed Under: , ,

Rate this comment as insightful
Rate this comment as funny
You have rated this comment as insightful
You have rated this comment as funny
Flag this comment as abusive/trolling/spam
You have flagged this comment
The first word has already been claimed
The last word has already been claimed
Insightful Lightbulb icon Funny Laughing icon Abusive/trolling/spam Flag icon Insightful badge Lightbulb icon Funny badge Laughing icon Comments icon

Comments on “One More Time With Feeling: 'Anonymized' User Data Not Really Anonymous”

Subscribe: RSS Leave a comment
discordian_eris says:


The Western world is at the point now where data acquisition meets or exceeds that behind the Great Firewall of China. The only difference is that it is not the government doing the spying, it is business. Thanks to the third party doctrine and the All Writs act, the governments here in the US have access to all of it. And almost always without the need for something as onerous as a warrant or even probable cause.

Obama refused to crack down on these activities adequately and has handed Trump tools and weapons that no president should have. I’d say that it is up to Congress now to do their jobs, but since they work for the corporations, and not the people, that isn’t going to happen.

While I think that Obama did a number of good things, I sure wish that he had had the balls to actually heed the warnings that he was given. Like LBJ, (in regards to Vietnam), Obama was too worried about being called a pussy to do that right things about T.W.A.T. Now we all, Americans and the entire world, are going to be forced to deal with the consequences of his inaction.

I sure hope like hell that in 2020 Americans remember this kind of crap and put someone in the White House who isn’t too cowardly or psychotic to do the right thing for the country. It’s time we voted in people who KNOW that they work for the best interests of the people, not the best interests of the government, or the corporations.

Typical Business Executive says:

Everybody knows that in order to look people up in a database you start with their last name. And you start that search with the first letter of their last name. So, by removing the first letter of the last name from the data we have made it impossible to link any of the data to any particular person because we have made it impossible to look them up! Thus, we have achieved perfectly anonymized data.

Graham Cobb (profile) says:

How do we fight this?

Two possible routes (I am sure there are others):

  1. The law. Allow for data to be marked, by the owner or source, as "anonymised" (whether any technical steps are taken or not) and make it a criminal offence to either (i) attempt to de-anonymise, or (ii) correlate such data with any other data. This should be enough to prevent (for example) insurance companies using such data to set premiums and it might even be enough to prevent major commercial data brokers from using the data (although steps would have to be taken to make sure investigation and penalties are severe enough to prevent data-washing, possibly abroad). Of course, it has no effect on governments, nor on commercial deals where the source is not willing to mark the data as "anonymised".

  2. Publish standards (NIST?) for anonymisation. Maybe not so much specific algorithms as principles. For example, if identifiers are to replaced by meaningless numbers, the identifier-to-number mapping must change more frequently than an adversary is likely to be able to gether enough data to de-anonymise. These would have to be based on research. For example, based on the research in the article, a database of tweets might need to change the mapping of the profile name every 29 tweets, or something. Or a database of ANPR data showing traffic movements might have to change the vehicle pseudo-identity every 1 hour.

These two steps would also have to be accompanied by greater public awareness of de-anonymisation. The legal route is particularly important in making sure that companies cannot claim something is "anonymised" unless there are ways for the data subjects to actually enforce it.

Anonymous Coward says:

It is not just Internet information that is currently being used. Quintiles/IMS gets prescribing information from pharmacists concerning which prescriptions are filled. They then anonymize that information and then resell that information to pharmaceutical companies and anyone else that wants to buy it. Then that information is sent to other companies to append personal information to the records. The hit rate of a match is pretty good, otherwise no one would buy it. Ask Quintiles/IMS about this and you won’t get a very clear answer of exactly what they do.

historygeek (profile) says:

All of today’s major web browsers collect and give out specific information about the computer being used. Not just things like the MAC address which is necessary for the current internet protocols but also what your operating system is. Geographical locators are turned on by default to optimize search results and simplify mapping. And now many websites take a “portrait” of the icons on your desktop and their arrangement as well as a list of all the programs installed on your system. Which is about as unique as a fingerprint. Furthermore a recent study showed that individual computer users could by reliably identified by the patterns formed by the routine movements they used/made with a computer mouser [as shown by the travel of the cursor across the screen]. Unless you are making serious, consistent efforts to hide your online behaviour you are always personally identifiable. This is the standard state of affairs.

Add Your Comment

Your email address will not be published. Required fields are marked *

Have a Techdirt Account? Sign in now. Want one? Register here

Comment Options:

Make this the or (get credits or sign in to see balance) what's this?

What's this?

Techdirt community members with Techdirt Credits can spotlight a comment as either the "First Word" or "Last Word" on a particular comment thread. Credits can be purchased at the Techdirt Insider Shop »

Follow Techdirt

Techdirt Daily Newsletter

Techdirt Deals
Techdirt Insider Discord
The latest chatter on the Techdirt Insider Discord channel...