Once More With Feeling: 'Anonymized' Data Is Not Really Anonymous
from the nothing-to-see-here dept
As companies and governments increasingly hoover up our personal data, a common refrain to keep people from worrying is the claim that nothing can go wrong because the data itself is “anonymized” or stripped of personal detail. But time and time again, we’ve noted how this really is cold comfort; given it takes only a little effort to pretty quickly identify a person based on access to other data sets. Yet most companies (including cell phone companies that sell your location data) act as if “anonymizing” your data is iron-clad protection from having it identified. It’s simply not true.
The latest case in point: in new research published this week in the journal Nature Communications, data scientists from Imperial College London and UCLouvain found that it wasn’t particularly hard for companies (or, anybody else) to identify the person behind “anonymized” data using other data sets. More specifically, the researchers developed a machine learning model that was able to correctly re-identify 99.98% of Americans in any anonymised dataset using just 15 characteristics including age, gender and marital status:
“While there might be a lot of people who are in their thirties, male, and living in New York City, far fewer of them were also born on 5 January, are driving a red sports car, and live with two kids (both girls) and one dog,? explained study first author Dr Luc Rocher, from UCLouvain.”
And using fifteen datasets is actually pretty high for this sort of study. One investigation of “anonymized” user credit card data by MIT found that users could be correctly “de-anonymized” 90 percent of the time using just four relatively vague points of information. Another study looking at vehicle data found that 15 minutes? worth of data from just brake pedal use could lead them to choose the right driver, out of 15 options, 90% of the time.
The problem, of course, comes when multiple leaked data sets are released in the wild and can be cross referenced by attackers (state sponsored or otherwise), de-anonymized, then abused. The researchers in this new study were quick to proclaim how government and industry proclamations of “don’t worry, it’s anonymized!” are dangerous and inadequate:
“Companies and governments have downplayed the risk of re-identification by arguing that the datasets they sell are always incomplete,? said senior author Dr Yves-Alexandre de Montjoye, from Imperial?s Department of Computing, and Data Science Institute. “Our findings contradict this and demonstrate that an attacker could easily and accurately estimate the likelihood that the record they found belongs to the person they are looking for.”
It’s not clear how many studies like this we need before we stop using “anonymized” as some kind of magic word in privacy circles, but it’s apparently going to need to be a few dozen more.