There's No Such Thing As An Anonymized Dataset

from the statistical-analysis dept

Slashdot reports that a pair of computer scientists have figured out how to de-anonymize the "anonymous" data set that Netflix released as part of its million-dollar contest to improve its recommendation algorithm. The researchers found that the set of less-popular movies a user has rated tends to uniquely identify that user. By comparing movie ratings on IMDB with the ratings in the Netflix data set, the researchers were often able to uniquely pair a particular IMDB user with a corresponding Netflix user. And that meant the researcher would instantly have access to all of the user's Netflix ratings, which Netflix users presumably expected to remain private. While movie ratings might seem innocuous at first glance, the authors point out that one's movie ratings can often reveal potentially embarrassing personal details, including a user's views on politics, religion, and homosexuality. This isn't the first time a company has released "anonymous" data regarding its users that turned out not to be so anonymous. Last year, AOL got in a lot of hot water when it released a data set of search queries that turned out to be quite easy to link back to the users conducting the searches. The lesson here is that companies should be very reluctant to release private customer data, even if they believe they have "anonymized" it. Anonymization is surprisingly difficult, and you can never be sure you've done it successfully; it's always possible that someone will find a way to link records back to the people they represent. Wherever possible, companies needing to release data should either aggregate it in a way that avoids revealing information about individuals, or they should carefully limit who has access to the data sets, to avoid having the data sets become publicly available. Simply stripping out the "username" field doesn't cut it.
Hide this

Thank you for reading this Techdirt post. With so many things competing for everyone’s attention these days, we really appreciate you giving us your time. We work hard every day to put quality content out there for our community.

Techdirt is one of the few remaining truly independent media outlets. We do not have a giant corporation behind us, and we rely heavily on our community to support us, in an age when advertisers are increasingly uninterested in sponsoring small, independent sites — especially a site like ours that is unwilling to pull punches in its reporting and analysis.

While other websites have resorted to paywalls, registration requirements, and increasingly annoying/intrusive advertising, we have always kept Techdirt open and available to anyone. But in order to continue doing so, we need your support. We offer a variety of ways for our readers to support us, from direct donations to special subscriptions and cool merchandise — and every little bit helps. Thank you.

–The Techdirt Team

Filed Under: anonymity, data
Companies: netflix

Reader Comments

Subscribe: RSS

View by: Time | Thread

  1. icon
    Derek Kerton (profile), 3 Dec 2007 @ 4:35pm

    It Is A Problem

    This actually is a privacy problem for Netflix and its users. The problem is because what people choose to say and rate on IMDB (publicly) may not totally correspond to what they RENT or rate on Netflix (privately) -- yet the overlap can uniquely identify them.

    For example, say IMDB user Johnny8332 rents and highly rates a Lithuanian comedy and a Chinese drama film on both IMDB and Netflix. Let's assume Johnny is the only guy who saw and rated both films highly at both sites. Now we can link his Netflix behavior to his IMDB name.

    Next, Johnny doesn't want to tell anyone in the world that he's a closet homosexual and is extremely right wing (I don't get it, but it seems to happen, and Johnny's got every right to be a right-wing homosexual.) It's Johnny's prerogative to keep that personal info private.

    That's why when Johnny rents and enjoys "My Own Private Idaho", "FahrenHYPE 911", "Michael Moore Hates America", and some gay porn from Netflix, he also chooses not to go onto IMDB and rate them 8/10.

    But the researcher in the story has shown that he can identify and match Johnny8332's public IMDB persona with Johnny's private persona and private choices. This is a risk to Johnny's privacy. It can now be made public through his IMDB ID that he's a right wing homosexual.

    It's a lot like many of you who rent the porn in hotels, and as you check out the desk clerk attaching a 4" round pin on your suit lapel that says "I watched porn last night". It's your choice to watch porn at the hotel, but it's probably not a detail you would choose to publicly disclose.

    Netflix has unwittingly allowed itself to expose people in that way.

Add Your Comment

Have a Techdirt Account? Sign in now. Want one? Register here

Subscribe to the Techdirt Daily newsletter

Comment Options:

  • Use markdown. Use plain text.
  • Make this the First Word or Last Word. No thanks. (get credits or sign in to see balance)    
  • Remember name/email/url (set a cookie)

Follow Techdirt
Essential Reading
Techdirt Deals
Report this ad  |  Hide Techdirt ads
Techdirt Insider Discord

The latest chatter on the Techdirt Insider Discord channel...

Recent Stories

This site, like most other sites on the web, uses cookies. For more information, see our privacy policy. Got it

Email This

This feature is only available to registered users. Register or sign in to use it.