vin

November 30, 2007 at 3:37 pm

oh for shame. someone might be able to tell that someone relatively unique to me has rented a movie, if they put a ton of work into it and I have already signalled my lack of concern by essentially broadcasting this on imdb

Hallie

November 30, 2007 at 3:47 pm

Sorry, you’ve lost me on this one. If I’ve read this correctly, the data set was NOT “de-anonymized”, only a sub-set of it was. What percentage of the whole is it? I looked through their paper, but it doesn’t seem to be mentioned anywhere. And the only ones who can be identified are people who made a public profile on IMDb who presumably don’t mind being identified.

I also don’t agree with some of the conclusions to which the researchers jumped. For example, for one individual, they state “He did not like “Super Size Me” at all; perhaps this implies something about his physical size?” Or perhaps it implies that he thought the movie wasn’t well-made, or that the story was cliched? They also state “Strong guesses about his religious views can be made based on his ratings on “Jesus of Nazareth” and “The Gospel of John”.” If all they’re going on is a numerical rating without any written opinion about either movie, the only conclusion they could reasonably make would be what he thought about the movies as movies.

Sure seems to be a lot of noise without much substance.

Jim Harper

November 30, 2007 at 3:48 pm

Erm, great insight, vin.

Anyone in the TechDirt commmunity – what is that, TechDirters? – know about synthesized data. I wrote about it some here, but don’t have a lot of knowledge.

Matt O (user link)

November 30, 2007 at 7:34 pm

the Sci-Fi Tempest in a teapot

While I agree that the theory is indeed pretty serious, the actual application here is pretty sketchy. You have to have both a Netflix account and an IMDB account in this scenario and use similar information in both of them.

What’s interesting about this to me is that this social hack is identical to breaking substitution cyphers – go with the most obvious data first (okay, so in code, you pick the Most frequent letters and combos and bang away on them for awhile and with Netflix you pick the Least Likely movies to repeat, but the *concept* is identical) and just bang away until you’re pretty sure you have a match.

I’d be interested to learn some of the math theory behind the matching algorithm even though I’m far from a mathematician.

Also, can’t I, as a netflix user just claim that the match is incorrect? I mean, if I’m given the chance to respond. I know that what we’re talking about here is reputation, so accusation is enough, but still – I agree that this is tempest in a teapot, at least a little bit.

Ferin

December 3, 2007 at 5:15 am

Not really de anonymized...

Freom what I’d read on the slashdot posting, all the researchers had done was find a subset of imdb users who’d made reviews on both imdb and netflix with similar timing and content. While I’m sure you could positively ID a few of these people from this, it’s not a very strong ID, and at best all you;ve done is link an imdb screen name up with some data from netflix. Doesn’t seem particularly chilling to me.

Am I getting this one wrong?

Derek Kerton (profile)

December 3, 2007 at 4:35 pm

It Is A Problem

This actually is a privacy problem for Netflix and its users. The problem is because what people choose to say and rate on IMDB (publicly) may not totally correspond to what they RENT or rate on Netflix (privately) — yet the overlap can uniquely identify them.

For example, say IMDB user Johnny8332 rents and highly rates a Lithuanian comedy and a Chinese drama film on both IMDB and Netflix. Let’s assume Johnny is the only guy who saw and rated both films highly at both sites. Now we can link his Netflix behavior to his IMDB name.

Next, Johnny doesn’t want to tell anyone in the world that he’s a closet homosexual and is extremely right wing (I don’t get it, but it seems to happen, and Johnny’s got every right to be a right-wing homosexual.) It’s Johnny’s prerogative to keep that personal info private.

That’s why when Johnny rents and enjoys “My Own Private Idaho”, “FahrenHYPE 911”, “Michael Moore Hates America”, and some gay porn from Netflix, he also chooses not to go onto IMDB and rate them 8/10.

But the researcher in the story has shown that he can identify and match Johnny8332’s public IMDB persona with Johnny’s private persona and private choices. This is a risk to Johnny’s privacy. It can now be made public through his IMDB ID that he’s a right wing homosexual.

It’s a lot like many of you who rent the porn in hotels, and as you check out the desk clerk attaching a 4″ round pin on your suit lapel that says “I watched porn last night”. It’s your choice to watch porn at the hotel, but it’s probably not a detail you would choose to publicly disclose.

Netflix has unwittingly allowed itself to expose people in that way.

Celes

December 3, 2007 at 6:45 pm

Re: It Is A Problem

As for the hotel bit, I don’t know about all video providers for hotels, but the provider my hotel uses does not disclose the title or type of movie watched, only the price (which tends, of course, to be higher for porn but this is not true in all cases), so it’s impossible for the desk clerk to know what you were watching. Caveat: If you purchase the all-day porn package, it costs way more than anything else offered, so if your clerk is one of the few who actually pays attention to that sort of thing, yeah, they’ll know.

Michael

June 5, 2008 at 8:45 pm

Re: It Is A Problem

You might want to learn your left from your right.

Jeremy (user link)

May 12, 2010 at 12:12 pm

Census data is supposed to be "anonymized"

Just thought I’d throw that into the mix.

G-Minor

May 4, 2011 at 3:34 am

The BIG PITURE…whether its Netflix, Google / Android, RIM, Sony, Apple WE ARE BEING watched and tracked. These companies should be totally honest and up front with what they do with our personal data. NOT GIVING / SELLING it for a profit without notification or permission. How about notifying that individual first and ask them if they would like to make a profit by aggreeing to their info being sold. This way if a personal data gets realesed, he or she can sue that company and or get a 95% of the profits from every company that his his / her information at their ready to display or target. I’d sue for “defimation of character” and stolen identity.

Sunday
14:00	Funniest/Most Insightful Comments Of The Week At Techdirt (0)
Saturday
12:00	This Week In Techdirt History: May 10th - 16th (2)
Friday
19:39	Developer Promises To Keep Failed Online Game Servers Up: Art Deserves To Be Preserved (4)
15:24	Why The US Can't Adopt Ukraine's Innovative Approach To Unmanned Warfare Systems (16)
13:27	Let’s Help Children, Not Trial Lawyers (13)
11:03	Appeals Court Upholds Block Of ICE's BS 'Seven Day Notice' Detention Center Inspection Policy (3)
10:58	Daily Deal: Babbel Language Learning (All Languages) (0)
09:24	Trump's $10 Billion IRS Lawsuit May Become a $1.7 Billion Slush Fund for MAGA's Self-Proclaimed Victims (1)
05:30	Bari Weiss Let Benjamin Netanyahu Pick His Own Softball Interviewer (11)
Thursday
20:15	HHS Is A Chaos Engine: Marty Makary Out At FDA (8)

There's No Such Thing As An Anonymized Dataset

from the statistical-analysis dept

Comments on “There's No Such Thing As An Anonymized Dataset”

the Sci-Fi Tempest in a teapot

Not really de anonymized...

It Is A Problem

Re: It Is A Problem

Re: It Is A Problem

Census data is supposed to be "anonymized"

Add Your Comment Cancel reply

Comment Options:

What's this?

Get all our posts in your inbox with the Techdirt Daily Newsletter!

The Techdirt Greenhouse

Trending Posts

Sunday

Saturday

Friday

Thursday

More

Tools & Services

Company

Contact

More

There's No Such Thing As An Anonymized Dataset

from the statistical-analysis dept

Comments on “There's No Such Thing As An Anonymized Dataset”

Add Your Comment Cancel reply

Comment Options:

What's this?

Techdirt Daily Newsletter

Get all our posts in your inbox with the Techdirt Daily Newsletter!

The Techdirt Greenhouse

Trending Posts

Sunday

Saturday

Friday

Thursday

More

Email This Story

Tools & Services

Company

Contact

More