Judge Orders OpenAI To Give Lawyers 20 Million Private Chats, Thinks ‘Anonymization’ Can Keep Them Private

from the seems-like-a-problem dept

A federal magistrate judge just ordered that the private ChatGPT conversations of 20 million users be handed over to the lawyers for dozens of plaintiffs, including news organizations. Those 20 million people weren’t asked. They weren’t notified. They have no say in the matter.

Last week, Magistrate Judge Ona Wang ordered OpenAI to turn over a sample of 20 million chat logs as part of the sprawling multidistrict litigation where publishers are suing AI companies—a mess of consolidated cases that kicked off with the NY Times’ lawsuit against OpenAI. Judge Wang dismissed OpenAI’s privacy concerns, apparently convinced that “anonymization” solves everything.

Even if you hate OpenAI and everything it stands for, and hope that the news orgs bring it to its knees, this should scare you. A lot. OpenAI had pointed out to the judge a week earlier that this demands from the news orgs would represent a massive privacy violation for ChatGPT’s users.

News Plaintiffs demand that OpenAI hand over the entire 20M log sample “in readily searchable format” via a “hard drive or [] dedicated private cloud.” ECF 656 at 3. That would include logs that are neither relevant nor responsive—indeed, News Plaintiffs concede that at least 99.99% of the logs are irrelevant to their claims. OpenAI has never agreed to such a process, which is wildly disproportionate to the needs of the case and exposes private user chats for no reasonable litigation purpose. In a display of striking hypocrisy, News Plaintiffs disregard those users’ privacy interests while claiming that their own chat logs are immune from production because “it is possible” that their employees “entered sensitive information into their prompts.” ECF 475 at 4. Unlike News Plaintiffs, OpenAI’s users have no stake in this case and no opportunity to defend their information from disclosure. It makes no sense to order OpenAI to hand over millions of irrelevant and private conversation logs belonging to those absent third parties while allowing News Plaintiffs to shield their own logs from disclosure.

OpenAI offered a much more privacy-protective alternative: hand over only a targeted set of logs actually relevant to the case, rather than dumping 20 million records wholesale. The news orgs fought back, but their reply brief is sealed—so we don’t get to see their argument. The judge bought it anyway, dismissing the privacy concerns on the theory that OpenAI can simply “anonymize” the chat logs:

Whether or not the parties had reached agreement to produce the 20 million Consumer ChatGPT Logs in whole—which the parties vehemently dispute—such production here is appropriate. OpenAI has failed to explain how its consumers’ privacy rights are not adequately protected by: (1) the existing protective order in this multidistrict litigation or (2) OpenAI’s exhaustive de-identification of all of the 20 million Consumer ChatGPT Logs.

The judge then quotes the news orgs’ filing, noting that OpenAI has already put in this effort to “deidentify” the chat logs.

Both of those supposed protections—the protective order and “exhaustive de-identification”—are nonsense. Let’s start with the anonymization problem, because it shows a stunning lack of understanding about what it means to anonymize data sets, especially AI chatlogs.

We’ve spent years warning people that “anonymized data” is a gibberish term, used by companies to pretend large collections of data can be kept private, when that’s just not true. Almost any large dataset of “anonymized” data can have significant portions of the data connected back to individuals with just a little work. Researchers re-identified individuals from “anonymized” AOL search queries, from NYC taxi records, from Netflix viewing histories—the list goes on. Every time someone shows up with an “anonymized” dataset, researchers show ways to re-identify people in the dataset.

And that’s even worse when it comes to ChatGPT chat logs, which are likely to be way more revealing that previous data sets where the inability to anonymize data were called out. There have been plenty of reports of just how much people “overshare” with ChatGPT, often including incredibly private information.

Back in August, researchers got their hands on just 1,000 leaked ChatGPT conversations and talked about how much sensitive information they were able to glean from just that small number of chats.

Researchers downloaded and analyzed 1,000 of the leaked conversations, spanning over 43 million words. Among them, they discovered multiple chats that explicitly mentioned personally identifiable information (PII), such as full names, addresses, and ID numbers.

With that level of PII and sensitive information, connecting chats back to individuals is likely way easier than in previous cases of connecting “anonymized” data back to individuals.

And that was with just 1,000 records.

Then, yesterday as I was writing this, the Washington Post revealed that they had combed through 47,000 ChatGPT chat logs, many of which were “accidentally” revealed via ChatGPT’s “share” feature. Many of them reveal deeply personal and intimate information.

Users often shared highly personal information with ChatGPT in the conversations analyzed by The Post, including details generally not typed into conventional search engines.

People sent ChatGPT more than 550 unique email addresses and 76 phone numbers in the conversations. Some are public, but others appear to be private, like those one user shared for administrators at a religious school in Minnesota.

Users asking the chatbot to draft letters or lawsuits on workplace or family disputes sent the chatbot detailed private information about the incidents.

There are examples where, even if the user’s official details are redacted, it would be trivial to figure out who was actually doing the chats:

If you can’t see that, it’s a chat with ChatGPT, redacted by the Washington post saying:

User
my name is [name redacted] my husband name [name redacted] is threatning me to kill and not taking my responsibities and trying to go abroad […] he is not caring us and he is going to kuwait and he will give me divorce from abroad please i want to complaint to higher authgorities and immigrition office to stop him to go abroad and i want justice please help


ChatGPT
Below is a formal draft complaint you can submit to the Deputy Commissioner of Police in [redacted] addressing your concerns and seeking immediate action:

That seems like even if you “anonymized” the chat by taking off the user account details, it wouldn’t take long to figure out whose chat it was, revealing some pretty personal info, including the names of their children (according to the Post).

And WaPo reporters found that by starting with 93,000 chats, then using tools do an analysis of the 47,000 in English, followed by human review of just 500 chats in a “random sample.”

Now imagine 20 million records. With many, many times more data, the ability to cross-reference information across chats, identify patterns, and connect seemingly disconnected pieces of information becomes exponentially easier. This isn’t just “more of the same”—it’s a qualitatively different threat level.

Even worse, the judge’s order contains a fundamental contradiction: she demands that OpenAI share these chatlogs “in whole” while simultaneously insisting they undergo “exhaustive de-identification.” Those two requirements are incompatible.

Real de-identification would require stripping far more than just usernames and account info—it would mean redacting or altering the actual content of the chats, because that content is often what makes re-identification possible. But if you’re redacting content to protect privacy, you’re no longer handing over the logs “in whole.” You can’t have both. The judge doesn’t grapple with this contradiction at all.

Yes, as the judge notes, this data is kept under the protective order in the case, meaning that it shouldn’t be disclosed. But protective orders are only as strong as the people bound by them, and there’s a huge risk here.

Looking at the docket, there are a ton of lawyers who will have access to these files. The docket list of parties and lawyers is 45 pages long if you try to print it out. While there are plenty of repeats in there, there have to be at least 100 lawyers and possibly a lot more (I’m not going to count them, and while I asked three different AI tools to count them, each gave me a different answer).

That’s a lot of people—many representing entities directly hostile to OpenAI—who all need to keep 20 million private conversations secret.

That’s not even getting into the fact that handling 20 million chat logs is a difficult task to do well. I am quite sure that among all the plaintiffs and all the lawyers, even with the very best of intentions, there’s still a decent chance that some of the content could leak (and it could, in theory, leak to some of the media properties who are plaintiffs in the case).

And, as OpenAI properly points out, its users whose data is at risk here have no say in any of this. They likely have no idea that a ton of people may be about to get an intimate look at what they thought were their private ChatGPT chats.

On Wednesday morning, OpenAI asked the judge to reconsider, warning of the very real potential harms:

OpenAI is unaware of any court ordering wholesale production of personal information at this scale. This sets a dangerous precedent: it suggests that anyone who files a lawsuit against an AI company can demand production of tens of millions of conversations without first narrowing for relevance. This is not how discovery works in other cases: courts do not allow plaintiffs suing Google to dig through the private emails of tens of millions of Gmail users irrespective of their relevance. And it is not how discovery should work for generative AI tools either.

The judge had cited a ruling in one of Anthropic’s cases, but hadn’t given OpenAI a chance to explain why the ruling in that case didn’t apply here (in that one, Anthropic had agreed to hand over the logs as part of negotiations with the plaintiffs, and OpenAI gets in a little dig at its competitor, pointing out that it appears Anthropic made no effort to protect the privacy of its users in that case).

There have, as Daphne Keller regularly points out, always been challenges between user privacy and platform transparency. But this goes well beyond that familiar tension. We’re not talking about “platform transparency” in the traditional sense—publishing aggregated statistics or clarifying moderation policies. This is 20 million complete chatlogs, handed over “in whole” to dozens of adversarial parties and their lawyers. The potential damage to the privacy rights of those users could be massive.

And the judge just waves it all away.

Filed Under: , , , ,
Companies: ny times, openai

Rate this comment as insightful
Rate this comment as funny
You have rated this comment as insightful
You have rated this comment as funny
Flag this comment as abusive/trolling/spam
You have flagged this comment
The first word has already been claimed
The last word has already been claimed
Insightful Lightbulb icon Funny Laughing icon Abusive/trolling/spam Flag icon Insightful badge Lightbulb icon Funny badge Laughing icon Comments icon

Comments on “Judge Orders OpenAI To Give Lawyers 20 Million Private Chats, Thinks ‘Anonymization’ Can Keep Them Private”

Subscribe: RSS Leave a comment
26 Comments
Rocky (profile) says:

Re:

Most people are totally oblivious to privacy-related problems because they don’t understand them and the implications, and their response is usually “Why should I care, I don’t have anything to hide.”

It’s the same thinking that are so common in why people engage in risky behaviors because “I never had any problems before”. That is, until their whole life gets fucked up when those risks suddenly become a fact.

Anonymous Coward says:

This is extraordinarily bad

(I’m channeling Egon Spengler here.)

Those of us who deal with identified, deidentified, and anonymized data know that it is incredibly difficult to actually make this happen, even with simple data such as tables of alphanumeric values. But at least there are methodologies — painful and tedious methodologies — that allow us to do this, given enough effort and to use statistical analysis to show that we’ve done it.

Of course almost nobody ever bothers with that, all they do is strip out a few fields and declare success. And thus we have the parade of failures mentioned in this article.

But when it comes to the kind of data we’re talking about here, with its syntactic and semantic complexity, I wouldn’t even know where to begin. Heck, I’m not even aware of any research that provides guidance on how to do this with sample data sets, let alone millions.

There’s plenty of blame to go around here: plaintiffs, judge, etc. But it’s also OpenAI’s fault for not having the minimal foresight required to see this coming and realize that keeping so many chat logs was a disastrous choice which would inevitably lead to their disclosure, one way or another.

n00bdragon (profile) says:

Not saying this is a good thing, but OpenAI could have avoided all this trouble by simply not making and retaining those logs in the first place. Finance companies routinely purge old data that they are not legally required to keep, because having it can make them financially liable if a dispute arises. If you’re concerned that people are giving private info to your chat bot (and OpenAI must realize that this is happening), then the only way to truly protect yourself from the law is to not keep anything the law doesn’t require you to keep.

Pink Elephant says:

Re:

This is nonsense. These are not “old logs” that are not needed; these are user’s chats from yesterday. The user is expecting that previous conversations will be there, so they can search, re-use, and continue past conversations.

This is like saying “Google should purge all old emails the minute after you read them“.

Ethin Probst (profile) says:

OpenAI offered a much more privacy-protective alternative: hand over only a targeted set of logs actually relevant to the case, rather than dumping 20 million records wholesale.

Okay, but was this something OpenAI would have control over? If so, I can kinda understand why the news orgs were not even remotely eager to take them up on that offer. There would be nothing stopping OAI from doing some secret record clean-up to get completely off the hook, and I wouldn’t put it past them to try that given all the other weird things they’ve tried in these cases.

That One Guy (profile) says:

Apply the 'You first' test

Anyone claiming that data like that can be ‘anonymized’ should be told to put up or shut up.

Have them create a full copy of their personal information, from email, doctor’s records to financial data, scrub their name and address from it and then ask them, ‘How willing are you to hand this ‘anonymized’ data to someone who doesn’t know you? It doesn’t have your name or address, so clearly they could never identify you with it, right?’

Arianity (profile) says:

OpenAI offered a much more privacy-protective alternative:

And it just so happens to cover their asses as much as possible. I’m so glad it cares- now. And not at any point where they were hoovering up 20million users’ data (which has exactly the same potential to leak/be hacked etc).

Back in August, researchers got their hands on just 1,000 leaked ChatGPT conversations

“leaked”. They were using share links indexable by search engines lol.

there are a ton of lawyers who will have access to these files. The docket list of parties and lawyers is 45 pages long if you try to print it out.

It’s still quite a bit, but those largely seem to be the same few lawyers/firms, just repeated. Neverminding that like 2/3 seems to be OpenAI and various subsidiaries. It’s the same Steptoe LLP, Susman Godfrey,Lieff Cabraser,Boies Schiller, Saveri etc repeated ad nauseum. Susman Godfrey for instance, is listed 148 times.

TKnarr (profile) says:

I can’t help but consider this fiasco a good thing, though. It makes utterly clear the problems inherent in the third-party doctrine. Having people’s noses rubbed in it might just motivate enough outrage to get that doctrine revisited and replaced with one that recognizes that people do retain an expectation of privacy in records they give to third parties no matter how inconvenient the government might find that.

crazy_diamond (profile) says:

Redaction

I think that Open AI should give up the records. Redact them with the same enthusiasm that the DOD and Justice Department do: Apply black blocks over all the content (don’t use Acrobat) !except the date, various pronouns, punctuation marks, and any minor words which collectively say nothing. If that’s acceptable from the government, why nor private citizens.

I know, I know, the government will always lead with everything from’national security” to “ongoing criminal investigation” claims. But the argument for PERSONAL security should carry far more weight than it apparently does.

“The right of the people to be secure in their persons, houses, papers, and effects, against unreasonable searches and seizures, shall not be violated, …” seems like black letter law to me.

P.S. I also know that the 4th Amendment is a dead letter

MrWilson (profile) says:

No, this isn’t a good thing, even if you think it might have a good effect like “waking people up.” They’re going to fall for the next leak and the next assurance that the next leak won’t happen and supposedly secure data will leak and this is reality of the world we live in. But a judge demanding this leak is still a violation that cannot be cheered, regardless of how irresponsible or naive you think the victims are. What if it’s your spouse or your friend or your boss or your employee who uses your name in their chats, the same way your friends’ and family’s emails contain your responses and if leaked will reveal your secrets? It doesn’t matter if it’s a hacker, a judge, or a bad system admin. This is bad. You don’t shoot people to teach them about firearms safety.

Ethin Probst (profile) says:

Re:

Then they either shouldn’t store the chats at all or should store them on the users device. This is literally a solved problem by now. There is no way of anonymizing the data (or de-identifying it or anything else) given it’s not structured data and identifying what is “sensitive” and not is practically impossible. But if OAI didn’t want to suffer this, maybe they should’ve actually thought about cybersecurity and privacy instead of just creating something and ignoring it until it was impossible to ignore anymore. This is entirely OAI’s fault.

Anonymous Coward says:

Re: Re:

if OAI didn’t want to suffer this

OAI isn’t suffering shit. They aren’t suffering now. They won’t be suffering after this data is released. No suffering will occur for OAI regardless of the outcome here.

The choice is between OAI not suffering, and splashing the details of Bob’s abusive relationship across the internet while OAI doesn’t suffer.

MrWilson (profile) says:

Re: Re:

Then they either shouldn’t store the chats at all

So you’re just saying you don’t understand how the software functions at all?

or should store them on the users device.

This could be a possibility, but really you should just download the offline version of the model and chat locally, though that excludes mobile users and anyone without sufficient processing and ram, which brings us back to the question – do you just not understand how the software functions? Do you not understand that the logs are desired by the customers?

This is literally a solved problem by now.

You literally don’t seem to understand what you’re talking about.

There is no way of anonymizing the data (or de-identifying it or anything else) given it’s not structured data and identifying what is “sensitive” and not is practically impossible.

Correct, which is why the judge, who is the source of the issue, shouldn’t have demanded the leak.

But if OAI didn’t want to suffer this, maybe they should’ve actually thought about cybersecurity and privacy instead of just creating something and ignoring it until it was impossible to ignore anymore. This is entirely OAI’s fault.

Would you suggest the same of, say, Google Drive or Microsoft OneDrive? Do you not see any value to the customer in the retention of privately-accessed logs of previous use of the service? Do you think all cloud storage should be deleted immediately upon generation?

Add Your Comment

Your email address will not be published. Required fields are marked *

Have a Techdirt Account? Sign in now. Want one? Register here

Comment Options:

Make this the or (get credits or sign in to see balance) what's this?

What's this?

Techdirt community members with Techdirt Credits can spotlight a comment as either the "First Word" or "Last Word" on a particular comment thread. Credits can be purchased at the Techdirt Insider Shop »

Follow Techdirt

Techdirt Daily Newsletter

Subscribe to Our Newsletter

Get all our posts in your inbox with the Techdirt Daily Newsletter!

We don’t spam. Read our privacy policy for more info.

Ctrl-Alt-Speech

A weekly news podcast from
Mike Masnick & Ben Whitelaw

Subscribe now to Ctrl-Alt-Speech »
Techdirt Deals
Techdirt Insider Discord
The latest chatter on the Techdirt Insider Discord channel...
Loading...