Can ChatGPT Violate Your Privacy Rights If It Doesn’t Store Your Data?

from the this-makes-no-sense dept

If you were to ask someone to state the birthday of someone else, and the person asked just made up a date, which was not the actual birthday, would you argue that the individual’s privacy had been violated? Would you argue that there should be a legal right to demand that the person explain how they came up with the made-up date and to permanently “store” the proper birth date in their mind?

Or would you simply laugh it off as utter nonsense?

I respect the folks at noyb, the European privacy activists who keep filing privacy complaints that often have significant consequences. noyb and its founder, Max Schrems, have pretty much single-handedly continued to rip up US/EU privacy agreements by highlighting that NSA surveillance simply cannot comply with EU data privacy protections.

That said, noyb often seems to take things a bit too far, and I think its latest complaint against OpenAI is one of those cases.

In the EU, the GDPR requires that information about individuals is accurate and that they have full access to the information stored, as well as information about the source. Surprisingly, however, OpenAI openly admits that it is unable to correct incorrect information on ChatGPT. Furthermore, the company cannot say where the data comes from or what data ChatGPT stores about individual people. The company is well aware of this problem, but doesn’t seem to care. Instead, OpenAI simply argues that “factual accuracy in large language models remains an area of active research”. Therefore, noyb today filed a complaint against OpenAI with the Austrian DPA.

I have to admit, sometimes I kinda wonder if noyb is really a kind of tech policy performance art, trying to make a mockery of the GDPR. Because that’s about the only way this complaint makes sense.

The assumptions underlying the complaint are that ChatGPT is something that it is not, that it does something that it does not do, and that this somehow implicates rights that are not implicated at all.

Again, Generative AI chat tools like ChatGPT are making up content based on what they’ve learned over time. It is not storing and collecting such data. It is not retrieving data that it has stored. Many people seem to think that ChatGPT is somehow the front end for a database, or the equivalent of a search engine.

It is not.

It is a digital guessing machine, trained on tons of written works. So, when you prompt it, it is probabilistically guessing at what it can say to respond in a reasonable, understandable manner. It’s predictive text on steroids. But it’s not grabbing data from a database. This is why it does silly things like make up legal cases that don’t exist. It’s not because it has bad data in its database. It’s because it’s making stuff up as it goes based on what “sounds” right.

And, yes, there are some cases where it seems closer to storing data, in that the nature of the training and the probabilistic engine is that it effectively has a very lossy compression algorithm that allows it to sometimes recreate data that closely approximates the original, but that’s still not the same thing as storing data in a database, and in the example used by noyb — a random person’s birthday — that’s simply not the kind of data that is at issue here.

Yet, noyb’s complaint is that ChatGPT can’t tell you what data it has on people (because it doesn’t “have data” on people) and that it can’t correct mistakes (because there’s nothing to “correct” since it’s not pulling what it writes from a database that can be corrected).

The complaint is kind of like saying that if you ask a friend of yours about someone else, and they repeat some false information, arguing that that friend is required under the GDPR to explain why they said what they said and to “correct” what is wrong.

But noyb insists this is true for ChatGPT.

Simply making up data about individuals is not an option. This is very much a structural problem. According to a recent New York Times report, “chatbots invent information at least 3 percent of the time – and as high as 27 percent”. To illustrate this issue, we can take a look at the complainant (a public figure) in our case against OpenAI. When asked about his birthday, ChatGPT repeatedly provided incorrect information instead of telling users that it doesn’t have the necessary data.

If this is actually a violation of the GDPR, noyb’s real complaint is with the GDPR, not with ChatGPT. Again, this only makes sense for an app that is storing and retrieving data.

But that’s not what’s happening. ChatGPT is probabilistically guessing at what to respond with.

No GDPR rights for individuals captured by ChatGPT? Despite the fact that the complainant’s date of birth provided by ChatGPT is incorrect, OpenAI refused his request to rectify or erase the data, arguing that it wasn’t possible to correct data.

There is no data to correct. This is just functionally wrong. It’s like filing a complaint against an orange for not being an apple. It’s just a fundamentally different kind of service.

Now, there are some attempts at generative AI tools that do store data. The hot topic in the generative AI world these days is RAGs, “retrieval augmented generation,” in which an AI is also “retrieving” data from some sort of database. noyb’s complaint would make more sense if it found a RAG that was returning false information. In such a scenario, the complaint would fit.

But when we’re talking about a regular old generative AI model without retrieval capabilities, it makes no sense at all.

If noyb honestly thinks that what ChatGPT is doing is violating the GDPR, then there are only two possibilities: (1) noyb has no idea what it’s talking about here or (2) the GDPR is even more silly than we’ve argued in the past, and all noyb is doing is trolling to make that clear by filing a laughably silly complaint that exposes how poorly fit the GDPR is to the technology in our lives today.

Filed Under: , , , , , ,
Companies: noyb, openai

Rate this comment as insightful
Rate this comment as funny
You have rated this comment as insightful
You have rated this comment as funny
Flag this comment as abusive/trolling/spam
You have flagged this comment
The first word has already been claimed
The last word has already been claimed
Insightful Lightbulb icon Funny Laughing icon Abusive/trolling/spam Flag icon Insightful badge Lightbulb icon Funny badge Laughing icon Comments icon

Comments on “Can ChatGPT Violate Your Privacy Rights If It Doesn’t Store Your Data?”

Subscribe: RSS Leave a comment
41 Comments
JSpitzen (profile) says:

Duty to correct misinformation

I may be old enough to know a trivia fact unknown to many TechDirt readers. Once upon a time, when computers were first able to simulate fax machines, some of the European countries required vendors to register their systems before they could be allowed to send computer-generated faxes to the corresponding country. One of them–I think it was France–had a requirement that if the computer dialed what it thought was a fax number but a human being answered the phone, that number had to be stored so that the system could remember never to call it again.

During that same era, another country–might have been Switzerland–required that the sending machine “listen” for a special tone that indicated an emergency and then immediately disconnect. That would allow the authorities to send a notification warning the recipients to head for their bomb shelters.

Anonymous Coward says:

Here's a hypothetical

What happens if ChatGPT is asked a question about a private bit of personal data and happens to answer correctly? (Depending on the question and the data this may be improbable or likely.)

Obviously it didn’t pull the answer from a database: it constructed it from its internal language model. So the answer doesn’t actually exist, anywhere, per se, inside ChatGPT.

How does this play with the GDPR?

MrWilson (profile) says:

Re: Re: Re:

Exactly. Literally everything an LLM that is not pulling from a database or search engine says is “made up” in the sense that it is just predicting the next words in a sentence rather than actually trying to be factually accurate. It’s like a parrot that can just repeat a lot more words and phrases and recognize more complicated patterns of how those words and phrases might be combined and ordered.

This comment has been flagged by the community. Click here to show it.

Benjamin Jay Barber says:

Mike Masnick Malding Again

I don’t disagree with the policy positions of mike on this issue, but he again demonstrates his ignorance of both the law and how these systems work.

  1. There is a privacy tort called “false light”, which is what is being violated when the LLM hallucinates facts.
  2. The AI models DO store data, and they are in the form of a database, that is in fact what the “weights” of the neural network are.
  3. A neural network does not NEED to compress data, and thereby perform hallucinations, these models can be “over-parametrized”, its just very much more expensive to increase the numbers of parameters, but the choice of compression level and the scope of the training data is at the discretion of the company.
  4. A neural network can to some degree know when it might be hallucinating, when performing the “softmax” operation to predict the next tokens, analyzing the “perplexity” of the token candidates.
This comment has been deemed insightful by the community.
MrWilson (profile) says:

Re:

There is a privacy tort called “false light”, which is what is being violated when the LLM hallucinates facts.

False light is a privacy tort in the US. We’re talking about Europe. But also, false light typically requires the defendant to publish the information widely rather than just in a private chat, it requires the misinformation to be highly offensive to a reasonable person, and the defendant must be at fault. These requirements aren’t met unless you can prove the company intentionally programmed an LLM to specifically identify and defame individuals and did so to a large audience rather than just one person in a chat. And no reasonable person, understanding that a non-human LLM literally makes up everything it says by its very nature (barring a web search or a RAG), would be offended by it. So you’re wrong on top of being wrong on top of being wrong.

The irony is that your hallucinated “facts” are more offensive than ChatGPT’s.

bhull242 (profile) says:

Re:

There is a privacy tort called “false light”, which is what is being violated when the LLM hallucinates facts.

That tort does not exist in Europe, where this lawsuit was filed, and so it is irrelevant. The relevant law here is the GDPR. That’s what the complaint references.

The AI models DO store data, and they are in the form of a database, that is in fact what the “weights” of the neural network are.

It doesn’t store data about people or facts. It stores data about language. Those are not the same thing, and only the former can support this complaint.

A neural network does not NEED to compress data

It doesn’t actually compress data at all; that was just figurative language.

and thereby perform hallucinations, these models can be “over-parametrized”, its just very much more expensive to increase the numbers of parameters, but the choice of compression level and the scope of the training data is at the discretion of the company.

Yes, and the article speaks on such models. ChatGPT just isn’t one of them. The existence of others that do so is irrelevant to whether the law in question actually applies to what ChatGPT does.

A neural network can to some degree know when it might be hallucinating, when performing the “softmax” operation to predict the next tokens, analyzing the “perplexity” of the token candidates.

But it can never be eliminated altogether, nor is there any legal duty for AI makers to do so. This, too, is missing the point.

Anonymous Coward says:

Re: Re:

It doesn’t store data about people or facts. It stores data about language. Those are not the same thing, and only the former can support this complaint.

Correct. The GDPR protects personal information (from my reading of it), which doesn’t protect words, etc. If it did, everything would have to shut down for violation of the GDPR, including schools, where words are taught.

Mamba (profile) says:

ignorance of both the law and how these systems work.

Your assessment of his qualifications carries absolutely no authority considering you went to jail for six months based on your misunderstanding of the law. I’m fact, your criticism stands as a glowing endorsement.

I don’t even need to get into your misunderstanding of torts(not what’s under discussion), weights (this is not personal data, as it’s an aggregation of large data sets), or compression (as an analogy).

Anonymous Coward says:

The article critiques a recent complaint by the European privacy activists, noyb, against OpenAI regarding ChatGPT’s compliance with the GDPR. It argues that noyb’s complaint misunderstands the nature of generative AI tools like ChatGPT, which generate content without storing or retrieving data. The analogy of asking a friend about someone’s birthday, where the friend provides incorrect information, is used to illustrate that ChatGPT operates similarly—it generates responses based on learned data rather than retrieving stored information. The article suggests that noyb’s complaint is misplaced and questions whether it’s a genuine concern or a tactic to highlight flaws in the GDPR’s applicability to modern technology. It concludes by emphasizing the fundamental difference between generative AI models and those with retrieval capabilities, suggesting that if noyb believes ChatGPT violates GDPR, it either misunderstands the technology or exposes the GDPR’s inadequacy in regulating it.

Hey noyb: let’s team up to navigate the intricacies of privacy in our tech-driven world and champion humanity’s best interests together. We’re all in this digital adventure, let’s make it a collaborative one! 🌟

Arianity says:

If you were to ask someone to state the birthday of someone else, and the person asked just made up a date, which was not the actual birthday, would you argue that the individual’s privacy had been violated?

Would that violate GPDR? Just leaving the AI part aside, would making up “data” fall under it? I know it has some provisions for incorrect information, so I wonder if that would fall under that.

And, yes, there are some cases where it seems closer to storing data, in that the nature of the training and the probabilistic engine is that it effectively has a very lossy compression algorithm that allows it to sometimes recreate data that closely approximates the original, but that’s still not the same thing as storing data in a database,

If you look at the definitions in GPDR, it doesn’t actually say anything about e.g. a database:

‘personal data’ means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person;

‘processing’ means any operation or set of operations which is performed on personal data or on sets of personal data, whether or not by automated means, such as collection, recording, organisation, structuring, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, restriction, erasure or destruction;

Further, it says personal data should be:

accurate and, where necessary, kept up to date; every reasonable step must be taken to ensure that personal data that are inaccurate, having regard to the purposes for which they are processed, are erased or rectified without delay (‘accuracy’);

I think you could make the argument that something simply making something up isn’t accurate? It does mention in some places phrases like a “filing system”, though.

That said, going after training data seems like a way easier target, rather than outputs.

by filing a laughably silly complaint that exposes how poorly fit the GDPR is to the technology in our lives today.

For what it’s worth, this is is probably a bit moot in the long term, as the EU has an AI-specific bill coming up soon: https://artificialintelligenceact.eu/

Anonymous Coward says:

Re:

If you were to ask someone to state the birthday of someone else, and the person asked just made up a date, which was not the actual birthday, would you argue that the individual’s privacy had been violated?

Would that violate GPDR? Just leaving the AI part aside, would making up “data” fall under it?

Since I’m not a ‘data processor’ under the law, then no.

Anonymous Coward says:

I think the relevant phrase in GDPR is “structured filing system”? So a set of random sticky notes is not covered by GDPR, but the same information alphabetically indexed in a folder may be covered depending on the type of information.

Personally, I’d say an LLM is closer to ustructured than structured filing system.

Anonymous Coward says:

Re:

GDPR states ‘filing system’ means any structured set of personal data which are accessible according to specific criteria, whether centralised, decentralised or dispersed on a functional or geographical basis.

LLM pulls their answers from some data they store somewhere after having ingested the huge training dataset. It cannot work otherwise.
This is why LLM cannot yet be used on a smartphone for offline use, they require too much data storage (but still so much less that the training dataset).

Whether that stored data looks structured to the human eye is not the question. That data is necessarily structured in a way that allows the LLM to pull out answers that make some sense.

I understand that Material scope of GDPR (article 2) is fulfilled.

This comment has been flagged by the community. Click here to show it.

Anonymous Coward says:

Techdirt spectacularly miss the point

The product is sold as a search engine. That’s why this case makes sense.
If this stops llms being touted as search engine replacements then it’s a job well done.
This is a computer where you have to check the answers. There is no value in that beyond “oooh, doesn’t it sound like a human?”
Yes, it sounds like a human who hasn’t got a fucking clue….

MrWilson (profile) says:

Re:

The product is sold as a search engine

No. It literally isn’t. It didn’t even have search integrated until about six months ago.

That’s why this case makes sense.

So since it isn’t sold as a search engine, you’re admitting the case doesn’t make sense. Agreed.

This is a computer where you have to check the answers.

ChatGPT isn’t a computer.

There is no value in that beyond “oooh, doesn’t it sound like a human?”

It’s a tool. It has uses you apparently haven’t conceived of. That doesn’t make it valueless.

Yes, it sounds like a human who hasn’t got a fucking clue….

Are you saying your comment was written by ChatGPT because “it sounds like a human who hasn’t got a fucking clue”?

bhull242 (profile) says:

Re:

The product is sold as a search engine. That’s why this case makes sense.

Absolutely no one—least of all OpenAI—sells ChatGPT as a search engine. On the contrary, it includes disclaimers that explicitly say that it is not a search engine.

Whether or not some other company advertises some AI as a search engine is irrelevant because this isn’t a complaint filed against that other company or about that AI; it’s against OpenAI, who has never once claimed that ChatGPT was a search engine.

If this stops llms being touted as search engine replacements then it’s a job well done.

Since OpenAI does not tout ChatGPT as a search engine replacement and never has, the complaint was filed against the wrong party, so it makes no sense and will not accomplish that goal.

This is a computer where you have to check the answers.

You should always be doing that, anyways. Look at a calculator. It will sometimes give garbage answers due to rounding errors or due to bad inputs.

There is no value in that beyond “oooh, doesn’t it sound like a human?”

Value is in the eye of the beholder. Just because you don’t find any value in it doesn’t mean it doesn’t have value to someone.

At any rate, who cares if it has value beyond that? That doesn’t make the complaint any more valid since the law doesn’t require the product to have value.

Anonymous Coward says:

Re: Re:

“Look at a calculator. It will sometimes give garbage answers due to rounding errors or due to bad inputs.”

Bad inputs cause wrong answers?

I would assume the answer was right, the inputs were wrong. Not the calculators fault.

8X12 = 96

If you meant to input 122 rather 12, doesn’t make it a garbage answer, just a garbage analogy.

Anonymous Coward says:

An answer to a query is… an answer.
Whether the LLM is conscient that the sting of text output is the name of a person or not is irrelevant to GDPR.
If the string of text can help to identify a person, it is personal data for GDPR.
I would be interested to know how an LLM can work without storing data somewhere…

bhull242 (profile) says:

Re:

If the string of text can help to identify a person, it is personal data for GDPR.

But the GDPR doesn’t care about data that isn’t stored or published. ChatGPT does neither for strings of text that can identify a person.

I would be interested to know how an LLM can work without storing data somewhere…

It doesn’t store factual data. It stores probabilistic data about language patterns. Basically, none of what it outputs is stored anywhere in the system. It is created anew based on inputs from the user, and it doesn’t (usually) retain that output or those inputs for future retrieval. Indeed, you can get ChatGPT to give different answers to the same prompt, demonstrating that it hasn’t stored the answers at all.

Anonymous Coward says:

Re: Re:

2 important points:

A: the storage of some data is not a GDPR requirement. It is just one of the processes that fall under it. Any operation on personal data is a processing. It could be collection, transmission, combination, structuring, use and many others as set, without limits, in GDPR article 4.2.

B: personal data is any kind of information that can help identify a person, directly or not. Whether the data is human understandable or not and whether it is made-up of factual/pseudonymized/statistical/probabilistic/gibberish information or not is irrelevant to GDPR as long as it can help – using the LLM – identify a person).
An IP address, an ID number, a name, a physical address… are all strings of texts that could be used to help identify a person. It is not difficult to have an LLM output strings of texts that are personal data. Ask ChatGPT a question about Donald Trump and most of the output string of text will be personal data related to him (can help identify him). A simple list of Donald Trump’s achievements, even without his name, is also his personal data if someone can identify him from it.

In the present case, the public figure’s name have been collected by the company, have been used to train (combined/structured/…) an algorithm. The algorithm itself (containing the personal data in a statistical/probabilistic format we cannot understand), is stored . Upon receiving a query from a browser/app, the algorithm then uses and organises the data to communicate its answer in HTML form to the browser/app.
All these operations on personal data (whether human readable or not) fall under GDPR.

Openai themselves state that they do process personal data of that public figure when they say that they can block any information about the data subject.

What is manifestly unfounded or excessive in exercising one’s right to erasure, a right enshrined in GDPR? Again, whether that data is human readable or not is irrelevant to GDPR.
Openai should have taken the GDPR’s right to erasure into account when designing ChatGPT.

Disclaimer: ChatGPT was used for some translation verification.

Anonymous Coward says:

Re: Re: Re:

I follow on my previous post.
Regarding the date of birth of that public figure, it seems that it was not present in the training dataset. So, not present either within the algorithm.

Still a wrong date of birth related to the public figure (erroneous personal data) was communicated by Openai to the user’s browser in HTML form and stored as a conversation within the person’s account.
It seems no one knows how the algorithm make this up and it could indeed be difficult for Openai to delete a data that was made-up from nothing.
But it could probably delete it from the conversation that contained it.

But GDPR also requires that personal data processed should be accurate. Therefore, Openai have to ensure that the data related to a person (personal data) and that is communicated to his browser and stored with the conversation is accurate and Openai must be able to demonstrate which steps it took to ensure this accuracy.

Will be interesting to follow-up this case…

Anonymous Coward says:

Re: Re: Re:2

But GDPR also requires that personal data processed should be accurate. Therefore, Openai have to ensure that the data related to a person (personal data) and that is communicated to his browser and stored with the conversation is accurate and Openai must be able to demonstrate which steps it took to ensure this accuracy.

As someone who has to deal with the GDPR on a daily basis, I’m afraid you are wrong.

The definition of personal data and processing is very specific in the GDRP. Since no personal data is actually stored in an LLM there is no processing of it either even though the output from an LLM may look like personal data.

An analogy of what is happening is using statistics from census data:
* Use the most common surname + first name + average age + profession + city, what is the chance you get something that matches a real person?
* If there is a match, do the processing fall within the GDPR because the “personal data” was created from statistical data by coincidence?

The answer is no because no personal data was actually ever processed, which is why the GDPR is irrelevant for LLM’s.

See also: Infinite Monkey Theorem.

Anonymous Coward says:

Re: Re: Re:3

I do not think GDPR cares how personal data was generated : personal information is ANY information relating to a person who can be identified directly or indirectly.

Also, I do not agree that no personal data is stored within the algorithm. It is there, encoded and filed in a way only the algorithm can find and output.
If no personal data was stored within the algorithm, I believe the probability that it output mostly correct information about a public figure would be close to zero.

As for the storage, I agree I was wrong when I said “A: the storage of some data is not a GDPR requirement.” as personal data must be part of (or intended to form part of) a filing system.

Anonymous Coward says:

Re:

From the link you provided:

What if the request is manifestly unfounded or excessive?
If requests are manifestly unfounded or excessive, in particular because they are repetitive, you can:

charge a reasonable fee taking into account the administrative costs of providing the information; or
refuse to respond.
You have to be able to demonstrate how a request is manifestly unfounded or excessive.

Nice try at mendacity, though.

Add Your Comment

Your email address will not be published. Required fields are marked *

Have a Techdirt Account? Sign in now. Want one? Register here

Comment Options:

Make this the or (get credits or sign in to see balance) what's this?

What's this?

Techdirt community members with Techdirt Credits can spotlight a comment as either the "First Word" or "Last Word" on a particular comment thread. Credits can be purchased at the Techdirt Insider Shop »

Follow Techdirt

Techdirt Daily Newsletter

Ctrl-Alt-Speech

A weekly news podcast from
Mike Masnick & Ben Whitelaw

Subscribe now to Ctrl-Alt-Speech »
Techdirt Deals
Techdirt Insider Discord
The latest chatter on the Techdirt Insider Discord channel...
Loading...