Netflix $1 Million Award Shows The Value Of Collaboration... But Kicks Up New Privacy Questions

from the good...-and-bad dept

Back in July, we wrote about how the Netflix $1 million prize showed how much further research efforts could get by collaborating, rather than hoarding. Now that the official prize has been awarded, we're hearing even more about that point:
The blending of different statistical and machine-learning techniques "only works well if you combine models that approach the problem differently," said Chris Volinsky, a scientist at AT&T Research and a leader of the Bellkor team [which won]. "That's why collaboration has been so effective, because different people approach problems differently."
Indeed. There's plenty of research out there showing the leaps that are made in innovation when people with different approaches collaborate. Yet, with so much of a focus on "patents" representing "innovation," the opposite occurs. The patent system is all about hoarding information and making it harder to collaborate by putting tollbooths in the process. Many of the final "teams" involved a whole bunch of different approaches. Imagine if each one had a patent on their method. Think of how expensive that kind of innovation would be. Then, realize that there are plenty of technologies that face that exact problem today.

In the meantime, Paul Ohm is raising some serious questions about people's privacy on the new Netflix Prizes that are being announced. While Netflix claims that the data is anonymized, we've seen before that anonymous datasets are almost never anonymous, and in Netflix's case, the details are pretty bad:
Although I give Netflix a pass for its past privacy breach, I am astonished to learn from the New York Times that the company plans a second act:
The new contest is going to present the contestants with demographic and behavioral data, and they will be asked to model individuals' "taste profiles," the company said. The data set of more than 100 million entries will include information about renters' ages, gender, ZIP codes, genre ratings and previously chosen movies. Unlike the first challenge, the contest will have no specific accuracy target. Instead, $500,000 will be awarded to the team in the lead after six months, and $500,000 to the leader after 18 months.
Netflix should cancel this new, irresponsible contest, which it has dubbed Netflix Prize 2. Researchers have known for more than a decade that gender plus ZIP code plus birthdate uniquely identifies a significant percentage of Americans (87% according to Latanya Sweeney's famous study.) True, Netflix plans to release age not birthdate, but simple arithmetic shows that for many people in the country, gender plus ZIP code plus age will narrow their private movie preferences down to at most a few hundred people. Netflix needs to understand the concept of "information entropy": even if it is not revealing information tied to a single person, it is revealing information tied to so few that we should consider this a privacy breach.
Ohm also points out that this prize almost certainly violates the law:
Because of this, if it releases the data, Netflix might be breaking the law. The Video Privacy Protection Act (VPPA), 18 USC 2710 prohibits a "video tape service provider" (a broadly defined term) from revealing "personally identifiable information" about its customers. Aggrieved customers can sue providers under the VPPA and courts can order "not less than $2500" in damages for each violation. If somebody brings a class action lawsuit under this statute, Netflix might face millions of dollars in damages.

Additionally, the FTC might also decide to fine Netflix for violating its privacy policy as an unfair business practice.
It seems rather surprising that Netflix's lawyers did not consider this.


Reader Comments (rss)

(Flattened / Threaded)

  1.  
    identicon
    Anonymous Coward, Sep 22nd, 2009 @ 4:28pm

    Yes, but the defense is that the information was not revealed to the public. The information, if indeed it would have violated the act, was revealed only to employees and contractors.

    If blockbuster wanted to do a study on its renters, it would hire an outside firm to crunch all the numbers. It would have to provide the information to the firm. That would not violate the law. However, that firm, like those who recieved the information from netflix, would be under the same legal obligations as blockbuster/netflix.

    And zipcode+age+gender is not revealing of personally identifiable information. it is revealing of groups of renters. Now zip+4 may be a little more specific...

     

    reply to this | link to this | view in thread ]

  2.  
    icon
    Brooks (profile), Sep 22nd, 2009 @ 5:33pm

    So I was totally with this article, and that 87% figure grabbed me -- and then we go on to say that, actually, it's a totally meaningless figure because it's based on information that *won't* be released.

    Maybe there are privacy concerns. But that's a huge red herring that distorts the issue a lot. I mean, if they released credit card numbers and names, it would be a huge issue (but they aren't). Why include a stat and then say it's not relevant?

    Me, I'm a lot less bothered by something that "could" be used to reduce anonymity to "a few hundred individuals in some zip codes." Maybe there's an interesting conversation about at what point personally identifiable becomes non-personally identifiable. But my instinct is that this doesn't cross the line, once you parse the article for what's actually happening and not how it would be if different things were happening.

     

    reply to this | link to this | view in thread ]

  3.  
    identicon
    Michael Kirkland, Sep 22nd, 2009 @ 5:48pm

    Re:

    Zipcode+age+gender is enough to personally identify ~90% of Americans, so yes.

     

    reply to this | link to this | view in thread ]

  4.  
    icon
    Brooks (profile), Sep 22nd, 2009 @ 5:54pm

    Re: Re:

    No. Zipcode + gender + birthdate identifies 87% of Americans. Age != Birthdate.

     

    reply to this | link to this | view in thread ]

  5.  
    identicon
    Anonymous Coward, Sep 22nd, 2009 @ 6:07pm

    Re: Re: Re:

    I wonder if changing it from age to age group (5-10 year ranges) would be enough to satiate those with privacy concerns... Could probably work out something similar with zip codes

     

    reply to this | link to this | view in thread ]

  6.  
    identicon
    Big Al, Sep 22nd, 2009 @ 6:33pm

    Technically, as soon as the selection criteria are broad enough to encompass two individuals, the information is not personally identifiable in that there is still an element of doubt as to which of the individuals is being referred to.
    However, I don't think that will wash with privacy advocates or, come to think of it, the RIAA's legal team.

     

    reply to this | link to this | view in thread ]

  7.  
    icon
    Brooks (profile), Sep 22nd, 2009 @ 6:34pm

    Re: Re: Re: Re:

    Yep, but I guess the question is at what point it becomes "anonymous", and that's going to be a matter of opinion. Is it anonymous if I can say it's one out of these 100 people? 1 out of 1,000? 100,000?

     

    reply to this | link to this | view in thread ]

  8.  
    identicon
    Haelian, Sep 22nd, 2009 @ 7:42pm

    Re: Yes, but the defense is that the information was not revealed to the public.

    That's not true. The information was available to anyone who wanted to take part in the contest and was easily downloaded via the Netflix website.

     

    reply to this | link to this | view in thread ]

  9.  
    identicon
    Anonymous Coward, Sep 22nd, 2009 @ 7:53pm

    I'm not sure this is really a big deal for privacy.
    First I would like to know if anybody knows about something bad that could potentially happen to those that have been identified. Would any employer fired someone who rented "Beverly Hills Chihuahua (2008)" or have so many people making fun of him and suffer mental distress?

    Seriously what are the bad things that could happen in this case?

     

    reply to this | link to this | view in thread ]

  10.  
    identicon
    Anonymous Coward, Sep 22nd, 2009 @ 8:46pm

    1. This matter has nothing to do with patents. Thus, the gratuituous reference seems misplaced.

    2. What is one to make of the comment in the article attributed to the team finishing second that the vast majority of collaborations were not fruitful? No one doubts that collaboration can be helpful, but by no means should it be viewed as the general rule. Sometimes it helps. Sometimes it does not. It all depends upon the circumstances and the persons involved in the collaboration.

     

    reply to this | link to this | view in thread ]

  11.  
    icon
    scarr (profile), Sep 22nd, 2009 @ 10:00pm

    Re:

    That isn't true.

    First off, you have solid information for anything both people rented.

    Second, it might not take a large stretch to figure out which of the two is which based on trends. (For example, someone's profession or hobby might make it easy to pick out the one renting all the music documentaries.)

    Lastly, it could still leave a situation where you know someone rented either A or B, where both A and B might both be selections the person wouldn't want publicized.

    The ease of identification increases the more of an outlier you are in your community. While a student on a college campus would be very hard to identify, that same student in a small suburban neighbourhood could be uniquely identified without any other info.

    (Btw, I have no idea where the RIAA would factor into this.)

     

    reply to this | link to this | view in thread ]

  12.  
    icon
    Mike Masnick (profile), Sep 22nd, 2009 @ 11:11pm

    Re:

    This matter has nothing to do with patents. Thus, the gratuituous reference seems misplaced.

    Point missed, huh? The point is that patents make collaboration like this harder, not easier.

    What is one to make of the comment in the article attributed to the team finishing second that the vast majority of collaborations were not fruitful? No one doubts that collaboration can be helpful, but by no means should it be viewed as the general rule. Sometimes it helps. Sometimes it does not. It all depends upon the circumstances and the persons involved in the collaboration.

    Heh. Have you ever taken statistics? This is a classic statistics error. Just because most collaborations aren't fruitful doesn't mean collaboration isn't fruitful. In fact, just the opposite -- it means you should want to enable EVEN MORE collaboration to allow the good ones to get through.

     

    reply to this | link to this | view in thread ]

  13.  
    identicon
    Another AC, Sep 23rd, 2009 @ 6:01am

    Opt Out?

    They should allow you to opt out of these studies, alternatively, all they ask for is your birth year, change that by a year or 2 and they would never find you.

     

    reply to this | link to this | view in thread ]

  14.  
    identicon
    Anonymous Coward, Sep 23rd, 2009 @ 8:26am

    Re: Re:

    Since I saw no mention in the article about patents, bringing them up does seem to be a gratuitous reference. Had patents posed a problem I would have expected at least some mention, and yet the article contains nary a word.

     

    reply to this | link to this | view in thread ]

  15.  
    identicon
    Mike Orr, Sep 23rd, 2009 @ 9:50am

    Data CAN be further obusfcated - not 100% solution, but still

    Netflix can/should maybe use forms of obfuscation to increase the "Anonymity factor'. FOr example, while It makes sense that having a common zipcode is an important attribute, it (probably) does not matter WHICH zipcode it is, so all zipcodes can be scrambled in some uniform manner.

     

    reply to this | link to this | view in thread ]

  16.  
    icon
    Derek Kerton (profile), Sep 24th, 2009 @ 12:00am

    Re: Re: Re: Re: Re:

    Well, the big problem with Zipcode + gender + birthdate is that it uniquely identifies 87% of people. Uniquely identifying anyone is a problem.

    But how bad is it if the data can be tracked back to "well, it's from one of these five people". That's obviously nowhere near as bad. In fact, IMHO, there is a quantum leap in difference of privacy breach between uniquely identified, and ANY kind of uncertainty.

    But, as Brooks is thinking, a 1/1 match is very bad, a 1/2 match is bad. How much better is a 1/5, etc? At which point is it actually anonymous data?

    Anyway, I would argue that by using age instead of birthdate, they have reduced the likelihood of an exact match by a factor of 365 (I know, the precise stats calculations are much more complicated.) That goes a long way to protecting privacy. I think that the suggestion above, that they make it 5-year ranges, would be most acceptable, and would not significantly reduce the predictive value of the movie recomendation solutions.

     

    reply to this | link to this | view in thread ]

  17.  
    icon
    Griff (profile), Sep 25th, 2009 @ 5:21am

    Re: Re:

    Just because most collaborations aren't fruitful doesn't mean collaboration isn't fruitful.

    yes Mike, but conversely, just because the winners happened to collaborate does not prove the collaboration helped them win.

     

    reply to this | link to this | view in thread ]


Add Your Comment

Have a Techdirt Account? Sign in now. Want one? Register here
Get Techdirt’s Daily Email
Save me a cookie
  • Note: A CRLF will be replaced by a break tag (<br>), all other allowable HTML will remain intact
  • Allowed HTML Tags: <b> <i> <a> <em> <br> <strong> <blockquote> <hr> <tt>
Follow Techdirt
A word from our sponsors...
Essential Reading
Techdirt Reading List
Techdirt Insider Chat
A word from our sponsors...
Recent Stories
A word from our sponsors...

Close

Email This