Judge Orders OpenAI To Give Lawyers 20 Million Private Chats, Thinks ‘Anonymization’ Can Keep Them Private
from the seems-like-a-problem dept
A federal magistrate judge just ordered that the private ChatGPT conversations of 20 million users be handed over to the lawyers for dozens of plaintiffs, including news organizations. Those 20 million people weren’t asked. They weren’t notified. They have no say in the matter.
Last week, Magistrate Judge Ona Wang ordered OpenAI to turn over a sample of 20 million chat logs as part of the sprawling multidistrict litigation where publishers are suing AI companies—a mess of consolidated cases that kicked off with the NY Times’ lawsuit against OpenAI. Judge Wang dismissed OpenAI’s privacy concerns, apparently convinced that “anonymization” solves everything.
Even if you hate OpenAI and everything it stands for, and hope that the news orgs bring it to its knees, this should scare you. A lot. OpenAI had pointed out to the judge a week earlier that this demands from the news orgs would represent a massive privacy violation for ChatGPT’s users.
News Plaintiffs demand that OpenAI hand over the entire 20M log sample “in readily searchable format” via a “hard drive or [] dedicated private cloud.” ECF 656 at 3. That would include logs that are neither relevant nor responsive—indeed, News Plaintiffs concede that at least 99.99% of the logs are irrelevant to their claims. OpenAI has never agreed to such a process, which is wildly disproportionate to the needs of the case and exposes private user chats for no reasonable litigation purpose. In a display of striking hypocrisy, News Plaintiffs disregard those users’ privacy interests while claiming that their own chat logs are immune from production because “it is possible” that their employees “entered sensitive information into their prompts.” ECF 475 at 4. Unlike News Plaintiffs, OpenAI’s users have no stake in this case and no opportunity to defend their information from disclosure. It makes no sense to order OpenAI to hand over millions of irrelevant and private conversation logs belonging to those absent third parties while allowing News Plaintiffs to shield their own logs from disclosure.
OpenAI offered a much more privacy-protective alternative: hand over only a targeted set of logs actually relevant to the case, rather than dumping 20 million records wholesale. The news orgs fought back, but their reply brief is sealed—so we don’t get to see their argument. The judge bought it anyway, dismissing the privacy concerns on the theory that OpenAI can simply “anonymize” the chat logs:
Whether or not the parties had reached agreement to produce the 20 million Consumer ChatGPT Logs in whole—which the parties vehemently dispute—such production here is appropriate. OpenAI has failed to explain how its consumers’ privacy rights are not adequately protected by: (1) the existing protective order in this multidistrict litigation or (2) OpenAI’s exhaustive de-identification of all of the 20 million Consumer ChatGPT Logs.
The judge then quotes the news orgs’ filing, noting that OpenAI has already put in this effort to “deidentify” the chat logs.
Both of those supposed protections—the protective order and “exhaustive de-identification”—are nonsense. Let’s start with the anonymization problem, because it shows a stunning lack of understanding about what it means to anonymize data sets, especially AI chatlogs.
We’ve spent years warning people that “anonymized data” is a gibberish term, used by companies to pretend large collections of data can be kept private, when that’s just not true. Almost any large dataset of “anonymized” data can have significant portions of the data connected back to individuals with just a little work. Researchers re-identified individuals from “anonymized” AOL search queries, from NYC taxi records, from Netflix viewing histories—the list goes on. Every time someone shows up with an “anonymized” dataset, researchers show ways to re-identify people in the dataset.
And that’s even worse when it comes to ChatGPT chat logs, which are likely to be way more revealing that previous data sets where the inability to anonymize data were called out. There have been plenty of reports of just how much people “overshare” with ChatGPT, often including incredibly private information.
Back in August, researchers got their hands on just 1,000 leaked ChatGPT conversations and talked about how much sensitive information they were able to glean from just that small number of chats.
Researchers downloaded and analyzed 1,000 of the leaked conversations, spanning over 43 million words. Among them, they discovered multiple chats that explicitly mentioned personally identifiable information (PII), such as full names, addresses, and ID numbers.
With that level of PII and sensitive information, connecting chats back to individuals is likely way easier than in previous cases of connecting “anonymized” data back to individuals.
And that was with just 1,000 records.
Then, yesterday as I was writing this, the Washington Post revealed that they had combed through 47,000 ChatGPT chat logs, many of which were “accidentally” revealed via ChatGPT’s “share” feature. Many of them reveal deeply personal and intimate information.
Users often shared highly personal information with ChatGPT in the conversations analyzed by The Post, including details generally not typed into conventional search engines.
People sent ChatGPT more than 550 unique email addresses and 76 phone numbers in the conversations. Some are public, but others appear to be private, like those one user shared for administrators at a religious school in Minnesota.
Users asking the chatbot to draft letters or lawsuits on workplace or family disputes sent the chatbot detailed private information about the incidents.
There are examples where, even if the user’s official details are redacted, it would be trivial to figure out who was actually doing the chats:

If you can’t see that, it’s a chat with ChatGPT, redacted by the Washington post saying:
User
my name is [name redacted] my husband name [name redacted] is threatning me to kill and not taking my responsibities and trying to go abroad […] he is not caring us and he is going to kuwait and he will give me divorce from abroad please i want to complaint to higher authgorities and immigrition office to stop him to go abroad and i want justice please help
ChatGPT
Below is a formal draft complaint you can submit to the Deputy Commissioner of Police in [redacted] addressing your concerns and seeking immediate action:
That seems like even if you “anonymized” the chat by taking off the user account details, it wouldn’t take long to figure out whose chat it was, revealing some pretty personal info, including the names of their children (according to the Post).
And WaPo reporters found that by starting with 93,000 chats, then using tools do an analysis of the 47,000 in English, followed by human review of just 500 chats in a “random sample.”
Now imagine 20 million records. With many, many times more data, the ability to cross-reference information across chats, identify patterns, and connect seemingly disconnected pieces of information becomes exponentially easier. This isn’t just “more of the same”—it’s a qualitatively different threat level.
Even worse, the judge’s order contains a fundamental contradiction: she demands that OpenAI share these chatlogs “in whole” while simultaneously insisting they undergo “exhaustive de-identification.” Those two requirements are incompatible.
Real de-identification would require stripping far more than just usernames and account info—it would mean redacting or altering the actual content of the chats, because that content is often what makes re-identification possible. But if you’re redacting content to protect privacy, you’re no longer handing over the logs “in whole.” You can’t have both. The judge doesn’t grapple with this contradiction at all.
Yes, as the judge notes, this data is kept under the protective order in the case, meaning that it shouldn’t be disclosed. But protective orders are only as strong as the people bound by them, and there’s a huge risk here.
Looking at the docket, there are a ton of lawyers who will have access to these files. The docket list of parties and lawyers is 45 pages long if you try to print it out. While there are plenty of repeats in there, there have to be at least 100 lawyers and possibly a lot more (I’m not going to count them, and while I asked three different AI tools to count them, each gave me a different answer).
That’s a lot of people—many representing entities directly hostile to OpenAI—who all need to keep 20 million private conversations secret.
That’s not even getting into the fact that handling 20 million chat logs is a difficult task to do well. I am quite sure that among all the plaintiffs and all the lawyers, even with the very best of intentions, there’s still a decent chance that some of the content could leak (and it could, in theory, leak to some of the media properties who are plaintiffs in the case).
And, as OpenAI properly points out, its users whose data is at risk here have no say in any of this. They likely have no idea that a ton of people may be about to get an intimate look at what they thought were their private ChatGPT chats.
On Wednesday morning, OpenAI asked the judge to reconsider, warning of the very real potential harms:
OpenAI is unaware of any court ordering wholesale production of personal information at this scale. This sets a dangerous precedent: it suggests that anyone who files a lawsuit against an AI company can demand production of tens of millions of conversations without first narrowing for relevance. This is not how discovery works in other cases: courts do not allow plaintiffs suing Google to dig through the private emails of tens of millions of Gmail users irrespective of their relevance. And it is not how discovery should work for generative AI tools either.
The judge had cited a ruling in one of Anthropic’s cases, but hadn’t given OpenAI a chance to explain why the ruling in that case didn’t apply here (in that one, Anthropic had agreed to hand over the logs as part of negotiations with the plaintiffs, and OpenAI gets in a little dig at its competitor, pointing out that it appears Anthropic made no effort to protect the privacy of its users in that case).
There have, as Daphne Keller regularly points out, always been challenges between user privacy and platform transparency. But this goes well beyond that familiar tension. We’re not talking about “platform transparency” in the traditional sense—publishing aggregated statistics or clarifying moderation policies. This is 20 million complete chatlogs, handed over “in whole” to dozens of adversarial parties and their lawyers. The potential damage to the privacy rights of those users could be massive.
And the judge just waves it all away.
Filed Under: anonymized data, chat logs, chatgpt, ona wang, privacy
Companies: ny times, openai


Comments on “Judge Orders OpenAI To Give Lawyers 20 Million Private Chats, Thinks ‘Anonymization’ Can Keep Them Private”
I’m torn; yes, this is a grotesque privacy violation, but so is the simple existence of all those logged chats. Sometimes it takes a grotesque violation to get people’s attention and make them realize that it’s unwise to entrust a hallucinating bullshit machine with such personal information.
Re:
Most people are totally oblivious to privacy-related problems because they don’t understand them and the implications, and their response is usually “Why should I care, I don’t have anything to hide.”
It’s the same thinking that are so common in why people engage in risky behaviors because “I never had any problems before”. That is, until their whole life gets fucked up when those risks suddenly become a fact.
This is extraordinarily bad
(I’m channeling Egon Spengler here.)
Those of us who deal with identified, deidentified, and anonymized data know that it is incredibly difficult to actually make this happen, even with simple data such as tables of alphanumeric values. But at least there are methodologies — painful and tedious methodologies — that allow us to do this, given enough effort and to use statistical analysis to show that we’ve done it.
Of course almost nobody ever bothers with that, all they do is strip out a few fields and declare success. And thus we have the parade of failures mentioned in this article.
But when it comes to the kind of data we’re talking about here, with its syntactic and semantic complexity, I wouldn’t even know where to begin. Heck, I’m not even aware of any research that provides guidance on how to do this with sample data sets, let alone millions.
There’s plenty of blame to go around here: plaintiffs, judge, etc. But it’s also OpenAI’s fault for not having the minimal foresight required to see this coming and realize that keeping so many chat logs was a disastrous choice which would inevitably lead to their disclosure, one way or another.
Not saying this is a good thing, but OpenAI could have avoided all this trouble by simply not making and retaining those logs in the first place. Finance companies routinely purge old data that they are not legally required to keep, because having it can make them financially liable if a dispute arises. If you’re concerned that people are giving private info to your chat bot (and OpenAI must realize that this is happening), then the only way to truly protect yourself from the law is to not keep anything the law doesn’t require you to keep.
Re:
This is nonsense. These are not “old logs” that are not needed; these are user’s chats from yesterday. The user is expecting that previous conversations will be there, so they can search, re-use, and continue past conversations.
This is like saying “Google should purge all old emails the minute after you read them“.
Re: Re:
If only there were some magical way for the user to have control over their own data.
Ha, lol, how silly. Oh well.
Okay, but was this something OpenAI would have control over? If so, I can kinda understand why the news orgs were not even remotely eager to take them up on that offer. There would be nothing stopping OAI from doing some secret record clean-up to get completely off the hook, and I wouldn’t put it past them to try that given all the other weird things they’ve tried in these cases.
Apply the 'You first' test
Anyone claiming that data like that can be ‘anonymized’ should be told to put up or shut up.
Have them create a full copy of their personal information, from email, doctor’s records to financial data, scrub their name and address from it and then ask them, ‘How willing are you to hand this ‘anonymized’ data to someone who doesn’t know you? It doesn’t have your name or address, so clearly they could never identify you with it, right?’
Re:
Your point is well-taken, but the process you’ve described isn’t anonymization: it’s just de-identification.
And in a way that highlights the issue: a lot of people perform de-identification and then call it anonymization. But’s it not. Not even close.
And it just so happens to cover their asses as much as possible. I’m so glad it cares- now. And not at any point where they were hoovering up 20million users’ data (which has exactly the same potential to leak/be hacked etc).
“leaked”. They were using share links indexable by search engines lol.
It’s still quite a bit, but those largely seem to be the same few lawyers/firms, just repeated. Neverminding that like 2/3 seems to be OpenAI and various subsidiaries. It’s the same Steptoe LLP, Susman Godfrey,Lieff Cabraser,Boies Schiller, Saveri etc repeated ad nauseum. Susman Godfrey for instance, is listed 148 times.
Re:
That’s a depressingly common form of data leak.
So let me ask you.
If it was a question between protecting 20 million people and finding your child’s murderer would you feel the same way?
Beyond that this data was never safe by the basic fact it was collected at all.
Re:
So let me ask you.
If it was a question between violating 20 million peoples rights to more easily stop and find a potential criminal in the future, would you feel the same way?
Re:
One’s feelings don’t matter.
Of course protecting 20 million, or 20 people, from unreasonalble search, seizure, disclosure, etc., is more important.
I can’t help but consider this fiasco a good thing, though. It makes utterly clear the problems inherent in the third-party doctrine. Having people’s noses rubbed in it might just motivate enough outrage to get that doctrine revisited and replaced with one that recognizes that people do retain an expectation of privacy in records they give to third parties no matter how inconvenient the government might find that.
Redaction
I think that Open AI should give up the records. Redact them with the same enthusiasm that the DOD and Justice Department do: Apply black blocks over all the content (don’t use Acrobat) !except the date, various pronouns, punctuation marks, and any minor words which collectively say nothing. If that’s acceptable from the government, why nor private citizens.
I know, I know, the government will always lead with everything from’national security” to “ongoing criminal investigation” claims. But the argument for PERSONAL security should carry far more weight than it apparently does.
“The right of the people to be secure in their persons, houses, papers, and effects, against unreasonable searches and seizures, shall not be violated, …” seems like black letter law to me.
P.S. I also know that the 4th Amendment is a dead letter
Is it even clear that those 20 million chats are from only American customers? what if some of them are EU residents?? or Canadian or Australian? There’s no way that this kind of disclosure without notice+consent could possibly comply with privacy regs. in all of the other countries that DO actually have privacy laws…
Re:
If they do business in the EU they are most likely bound by the GDPR.
Re: Re:
That’s my point, is any of the EU customers would be bound by GDPR, Canadians by the Privacy Act, etc. and I couldn’t find anything in the various coverage about whether all of the 20 million chats are confirmed to only be from Americans
No, this isn’t a good thing, even if you think it might have a good effect like “waking people up.” They’re going to fall for the next leak and the next assurance that the next leak won’t happen and supposedly secure data will leak and this is reality of the world we live in. But a judge demanding this leak is still a violation that cannot be cheered, regardless of how irresponsible or naive you think the victims are. What if it’s your spouse or your friend or your boss or your employee who uses your name in their chats, the same way your friends’ and family’s emails contain your responses and if leaked will reveal your secrets? It doesn’t matter if it’s a hacker, a judge, or a bad system admin. This is bad. You don’t shoot people to teach them about firearms safety.
Re:
Then they either shouldn’t store the chats at all or should store them on the users device. This is literally a solved problem by now. There is no way of anonymizing the data (or de-identifying it or anything else) given it’s not structured data and identifying what is “sensitive” and not is practically impossible. But if OAI didn’t want to suffer this, maybe they should’ve actually thought about cybersecurity and privacy instead of just creating something and ignoring it until it was impossible to ignore anymore. This is entirely OAI’s fault.
Re: Re:
if OAI didn’t want to suffer this
OAI isn’t suffering shit. They aren’t suffering now. They won’t be suffering after this data is released. No suffering will occur for OAI regardless of the outcome here.
The choice is between OAI not suffering, and splashing the details of Bob’s abusive relationship across the internet while OAI doesn’t suffer.
Re: Re:
So you’re just saying you don’t understand how the software functions at all?
This could be a possibility, but really you should just download the offline version of the model and chat locally, though that excludes mobile users and anyone without sufficient processing and ram, which brings us back to the question – do you just not understand how the software functions? Do you not understand that the logs are desired by the customers?
You literally don’t seem to understand what you’re talking about.
Correct, which is why the judge, who is the source of the issue, shouldn’t have demanded the leak.
Would you suggest the same of, say, Google Drive or Microsoft OneDrive? Do you not see any value to the customer in the retention of privately-accessed logs of previous use of the service? Do you think all cloud storage should be deleted immediately upon generation?
What is this even about?
Oh, copyright claims, likely specious AF. But even if they were reasonable and solid claims; fuck you, plaintiffs, fuck you, court.
The judge is not acting in the interest of the constitution.
DuckDuckGo
DDG offers various AI models at duck.ai and promises “Free and private chats, anonymized by us. No account required. … No AI training on your conversations.”
Which might still help privacy even if you put your name and address in your queries.