One More Time With Feeling: 'Anonymized' User Data Not Really Anonymous
from the we-can-see-you dept
As companies and governments increasingly hoover up our personal data, a common refrain to keep people from worrying is the claim that nothing can go wrong -- because the data itself is "anonymized" -- or stripped of personal detail. But time and time again, we've noted how this really is cold comfort; given it takes only a little effort to pretty quickly identify a person based on access to other data sets. As cellular carriers in particular begin to collect every shred of browsing and location data, identifying "anonymized" data using just a little additional context has become arguably trivial.
Researchers from Stanford and Princeton universities plan to make this point once again via a new study being presented at the World Wide Web Conference in Perth, Australia this upcoming April. According to this new study, browsing habits can be easily linked to social media profiles to quickly identify users. In fact, using data from roughly 400 volunteers, the researchers found that they could identify the person behind an "anonymized" data set 70% of the time just by comparing their browsing data to their social media activity:
"The programs were able to find patterns among the different groups of data and use those patterns to identify users. The researchers note that the method is not perfect, and it requires a social media feed that includes a number of links to outside sites. However, they said that "given a history with 30 links originating from Twitter, we can deduce the corresponding Twitter profile more than 50 percent of the time."
The researchers had even greater success in an experiment they ran involving 374 volunteers who submitted web browsing information. The researchers were able to identify more than 70 percent of those users by comparing their web browsing data to hundreds of millions of public social media feeds.
Of course, with the sophistication of online tracking and behavior ad technology, this shouldn't be particularly surprising. Numerous researchers likewise have noted it's relatively simple to build systems that identify users with just a little additional context. That, of course, raises questions about how much protection "anonymizing" data actually has in both business practice, and should this data be hacked and released in the wild:
"Yves-Alexandre de Montjoye, an assistant professor at Imperial College London, said the research shows how "easy it is to build a full-scale 'de-anonymizationer' that needs nothing more than what's available to anyone who knows how to code." "All the evidence we have seen piling up over the years showing the strong limits of data anonymization, including this study, really emphasizes the need to rethink our approach to privacy and data protection in the age of big data," said de Montjoye.
And this doesn't even factor in how new technologies -- like Verizon's manipulation of user data packets -- allow companies to build sophisticated new profiles based on the combination of browsing data, location data, and modifying packet headers. The FCC's recently-passed broadband privacy rules were designed in part to acknowledge these new efforts, by allowing user data collection -- but only if this data was "not reasonably linkable" to individual users. But once you realize that all data -- "anonymized" or not -- is linkable to individual users, such a distinction becomes wholly irrelevant.
One of the study's authors, Princeton researcher Arvind Narayanan, has been warning that anonymous data isn't really anonymous for the better part of the last decade, yet it's not entirely clear when we intend to actually hear -- and understand -- his message.