Netflix $1 Million Award Shows The Value Of Collaboration… But Kicks Up New Privacy Questions

from the good...-and-bad dept

Back in July, we wrote about how the Netflix $1 million prize showed how much further research efforts could get by collaborating, rather than hoarding. Now that the official prize has been awarded, we’re hearing even more about that point:

The blending of different statistical and machine-learning techniques “only works well if you combine models that approach the problem differently,” said Chris Volinsky, a scientist at AT&T Research and a leader of the Bellkor team [which won]. “That’s why collaboration has been so effective, because different people approach problems differently.”

Indeed. There’s plenty of research out there showing the leaps that are made in innovation when people with different approaches collaborate. Yet, with so much of a focus on “patents” representing “innovation,” the opposite occurs. The patent system is all about hoarding information and making it harder to collaborate by putting tollbooths in the process. Many of the final “teams” involved a whole bunch of different approaches. Imagine if each one had a patent on their method. Think of how expensive that kind of innovation would be. Then, realize that there are plenty of technologies that face that exact problem today.

In the meantime, Paul Ohm is raising some serious questions about people’s privacy on the new Netflix Prizes that are being announced. While Netflix claims that the data is anonymized, we’ve seen before that anonymous datasets are almost never anonymous, and in Netflix’s case, the details are pretty bad:

Although I give Netflix a pass for its past privacy breach, I am astonished to learn from the New York Times that the company plans a second act:

The new contest is going to present the contestants with demographic and behavioral data, and they will be asked to model individuals’ “taste profiles,” the company said. The data set of more than 100 million entries will include information about renters’ ages, gender, ZIP codes, genre ratings and previously chosen movies. Unlike the first challenge, the contest will have no specific accuracy target. Instead, $500,000 will be awarded to the team in the lead after six months, and $500,000 to the leader after 18 months.

Netflix should cancel this new, irresponsible contest, which it has dubbed Netflix Prize 2. Researchers have known for more than a decade that gender plus ZIP code plus birthdate uniquely identifies a significant percentage of Americans (87% according to Latanya Sweeney’s famous study.) True, Netflix plans to release age not birthdate, but simple arithmetic shows that for many people in the country, gender plus ZIP code plus age will narrow their private movie preferences down to at most a few hundred people. Netflix needs to understand the concept of “information entropy”: even if it is not revealing information tied to a single person, it is revealing information tied to so few that we should consider this a privacy breach.

Ohm also points out that this prize almost certainly violates the law:

Because of this, if it releases the data, Netflix might be breaking the law. The Video Privacy Protection Act (VPPA), 18 USC 2710 prohibits a “video tape service provider” (a broadly defined term) from revealing “personally identifiable information” about its customers. Aggrieved customers can sue providers under the VPPA and courts can order “not less than $2500” in damages for each violation. If somebody brings a class action lawsuit under this statute, Netflix might face millions of dollars in damages.

Additionally, the FTC might also decide to fine Netflix for violating its privacy policy as an unfair business practice.

It seems rather surprising that Netflix’s lawyers did not consider this.

Filed Under: , ,
Companies: netflix

Rate this comment as insightful
Rate this comment as funny
You have rated this comment as insightful
You have rated this comment as funny
Flag this comment as abusive/trolling/spam
You have flagged this comment
The first word has already been claimed
The last word has already been claimed
Insightful Lightbulb icon Funny Laughing icon Abusive/trolling/spam Flag icon Insightful badge Lightbulb icon Funny badge Laughing icon Comments icon

Comments on “Netflix $1 Million Award Shows The Value Of Collaboration… But Kicks Up New Privacy Questions”

Subscribe: RSS Leave a comment
17 Comments
Anonymous Coward says:

Yes, but the defense is that the information was not revealed to the public. The information, if indeed it would have violated the act, was revealed only to employees and contractors.

If blockbuster wanted to do a study on its renters, it would hire an outside firm to crunch all the numbers. It would have to provide the information to the firm. That would not violate the law. However, that firm, like those who recieved the information from netflix, would be under the same legal obligations as blockbuster/netflix.

And zipcode+age+gender is not revealing of personally identifiable information. it is revealing of groups of renters. Now zip+4 may be a little more specific…

Derek Kerton (profile) says:

Re: Re: Re:3 Re:

Well, the big problem with Zipcode + gender + birthdate is that it uniquely identifies 87% of people. Uniquely identifying anyone is a problem.

But how bad is it if the data can be tracked back to “well, it’s from one of these five people”. That’s obviously nowhere near as bad. In fact, IMHO, there is a quantum leap in difference of privacy breach between uniquely identified, and ANY kind of uncertainty.

But, as Brooks is thinking, a 1/1 match is very bad, a 1/2 match is bad. How much better is a 1/5, etc? At which point is it actually anonymous data?

Anyway, I would argue that by using age instead of birthdate, they have reduced the likelihood of an exact match by a factor of 365 (I know, the precise stats calculations are much more complicated.) That goes a long way to protecting privacy. I think that the suggestion above, that they make it 5-year ranges, would be most acceptable, and would not significantly reduce the predictive value of the movie recomendation solutions.

Brooks (profile) says:

So I was totally with this article, and that 87% figure grabbed me — and then we go on to say that, actually, it’s a totally meaningless figure because it’s based on information that *won’t* be released.

Maybe there are privacy concerns. But that’s a huge red herring that distorts the issue a lot. I mean, if they released credit card numbers and names, it would be a huge issue (but they aren’t). Why include a stat and then say it’s not relevant?

Me, I’m a lot less bothered by something that “could” be used to reduce anonymity to “a few hundred individuals in some zip codes.” Maybe there’s an interesting conversation about at what point personally identifiable becomes non-personally identifiable. But my instinct is that this doesn’t cross the line, once you parse the article for what’s actually happening and not how it would be if different things were happening.

Big Al says:

Technically, as soon as the selection criteria are broad enough to encompass two individuals, the information is not personally identifiable in that there is still an element of doubt as to which of the individuals is being referred to.
However, I don’t think that will wash with privacy advocates or, come to think of it, the RIAA’s legal team.

scarr (profile) says:

Re: Re:

That isn’t true.

First off, you have solid information for anything both people rented.

Second, it might not take a large stretch to figure out which of the two is which based on trends. (For example, someone’s profession or hobby might make it easy to pick out the one renting all the music documentaries.)

Lastly, it could still leave a situation where you know someone rented either A or B, where both A and B might both be selections the person wouldn’t want publicized.

The ease of identification increases the more of an outlier you are in your community. While a student on a college campus would be very hard to identify, that same student in a small suburban neighbourhood could be uniquely identified without any other info.

(Btw, I have no idea where the RIAA would factor into this.)

Anonymous Coward says:

I’m not sure this is really a big deal for privacy.
First I would like to know if anybody knows about something bad that could potentially happen to those that have been identified. Would any employer fired someone who rented “Beverly Hills Chihuahua (2008)” or have so many people making fun of him and suffer mental distress?

Seriously what are the bad things that could happen in this case?

Anonymous Coward says:

1. This matter has nothing to do with patents. Thus, the gratuituous reference seems misplaced.

2. What is one to make of the comment in the article attributed to the team finishing second that the vast majority of collaborations were not fruitful? No one doubts that collaboration can be helpful, but by no means should it be viewed as the general rule. Sometimes it helps. Sometimes it does not. It all depends upon the circumstances and the persons involved in the collaboration.

Mike Masnick (profile) says:

Re: Re:

This matter has nothing to do with patents. Thus, the gratuituous reference seems misplaced.

Point missed, huh? The point is that patents make collaboration like this harder, not easier.

What is one to make of the comment in the article attributed to the team finishing second that the vast majority of collaborations were not fruitful? No one doubts that collaboration can be helpful, but by no means should it be viewed as the general rule. Sometimes it helps. Sometimes it does not. It all depends upon the circumstances and the persons involved in the collaboration.

Heh. Have you ever taken statistics? This is a classic statistics error. Just because most collaborations aren’t fruitful doesn’t mean collaboration isn’t fruitful. In fact, just the opposite — it means you should want to enable EVEN MORE collaboration to allow the good ones to get through.

Mike Orr says:

Data CAN be further obusfcated - not 100% solution, but still

Netflix can/should maybe use forms of obfuscation to increase the “Anonymity factor’. FOr example, while It makes sense that having a common zipcode is an important attribute, it (probably) does not matter WHICH zipcode it is, so all zipcodes can be scrambled in some uniform manner.

Add Your Comment

Your email address will not be published. Required fields are marked *

Have a Techdirt Account? Sign in now. Want one? Register here

Comment Options:

Make this the or (get credits or sign in to see balance) what's this?

What's this?

Techdirt community members with Techdirt Credits can spotlight a comment as either the "First Word" or "Last Word" on a particular comment thread. Credits can be purchased at the Techdirt Insider Shop »

Follow Techdirt

Techdirt Daily Newsletter

Ctrl-Alt-Speech

A weekly news podcast from
Mike Masnick & Ben Whitelaw

Subscribe now to Ctrl-Alt-Speech »
Techdirt Deals
Techdirt Insider Discord
The latest chatter on the Techdirt Insider Discord channel...
Loading...