Reddit’s ‘AI Scraping’ Lawsuit Is An Attack On The Open Internet

from the this-is-bad-for-the-open-internet dept

When Reddit sued “data scraper” companies and AI firm Perplexity earlier this week, I assumed it was another predictable skirmish over AI training data—the kind of case we’ve been tracking as companies try to wall off the open internet and set up toll booths. But reading the actual complaint made it clear this is something far more dangerous: Reddit isn’t just going after scrapers. It’s mounting a fundamental attack on the very concept of an open internet, using a twisted reading of copyright law that—if it succeeds—would break how search engines, archives, and the web itself operate.

Even if you love Reddit and hate AI, you should be worried about this lawsuit. If it succeeds, it would fundamentally close off most of the open internet.

Most reporting on this is not actually explaining the nuances, which require a deeper understanding of the law, but fundamentally, Reddit is NOT arguing that these companies are illegally scraping Reddit, but rather that they are illegally scraping… Google (which is not a party to the lawsuit) and in doing so violating the DMCA’s anti-circumvention clause, over content Reddit holds no copyright over. And, then, Perplexity is effectively being sued for linking to Reddit.

This is… bonkers on so many levels. And, incredibly, within their lawsuit, Reddit defends its arguments by claiming it’s filing this lawsuit to protect the open internet. It is not. It is doing the exact opposite.

The Background

It is totally reasonable to be concerned about the burden that data scrapers put on websites, and to talk about ways to deal with them. But that’s not what this lawsuit really is. It’s mostly focused on some companies that effectively have built unofficial APIs for getting search results data out of Google. That can be quite useful in some cases! But also, some of the companies in this space can be fairly sketchy. Reddit leans heavily on the sketchiness of the companies to imply “they’re bad.”

But, an open web must mean a programmable web of some sort. Building on other services is a fundamental part of the open web and has always been there. If the building becomes abusive, then there are often technical ways of dealing with it. But here, the “abuse” seems to be Reddit signed a $60 million scraping deal with Google, which was already kinda sketchy.

After all, Reddit has a license to the content users post in order to operate the service, but they don’t hold the copyright on it. Indeed, Reddit’s terms state clearly that users retain “any ownership rights you have in Your content.” Because of Reddit’s agreement that it can license content, the deal with Google could sorta squeeze under that term, but that doesn’t give Reddit the right to then sue over users’ copyrights (as it’s doing in this case).

Either way, there’s an indication that Reddit has gotten greedy. It’s apparently reopened negotiations with Google recently, seeking more money and traffic. But it also wants money from other AI providers. Apparently, that includes Perplexity, which is a pretty useful AI “answer engine” that lets users select from a variety of underlying LLMs (Perplexity has released its own LLMs, but they were modifications of open source LLMs including Llama (from Meta) and Mistral, a popular open source LLM from France. Thus, while Perplexity has offered its own models, it didn’t train them itself).

Because Perplexity is much more focused on being an alternative to a search engine than a traditional “chat bot,” its focus in answering your questions is to actually provide links as sources for the answers it gives. In effect, it combines a traditional search engine with an LLM and it did this before many other chatbot LLMs added web search capabilities (though most now have them).

But that means, if an “answer” to a question from a user comes from a Reddit post, Perplexity is likely to link to it, just like a regular search engine. But, Reddit wants to get paid. And because Reddit has become so closed and persnickety about things, it looks like Perplexity may have chosen to use these other data scraping firms’ unofficial Google search results APIs to find Reddit posts and link to them.

This is… how the open internet is supposed to work, actually. But Reddit presents it as a sneaky “circumvention.”

Recognizing that Reddit denies scrapers like them access to its site, Defendants SerpApi, Oxylabs, and AWMProxy scrape the data from Google’s search results instead. They do so by masking their identities, hiding their locations, and disguising their web scrapers as regular people (among other techniques) to circumvent or bypass the security restrictions meant to stop them. For example, during a two-week span in July 2025, Defendants SerpApi, Oxylabs, and AWMProxy circumvented Google’s technological control measures and automatedly accessed, without authorization, almost three billion search engine results pages (“SERPs”) containing Reddit text, URLs, images, and videos.

That’s Not How Circumvention Works

So you might notice something weird in the paragraph above. Namely the claim that the API/scraping companies “circumvented Google’s technological control measures.”

The phrase “Technological control measures” (TCMs) should set off alarm bells for copyright nerds. It’s part of Section 1201 of the DMCA, or the “anti-circumvention” provision. We’ve talked about it for ages, how it’s widely abused, how it threatens innovation, and how it should be abolished.

The fundamental issue is that it says any attempt to “circumvent a technological measure” that tries to protect a copyright-protected work is, itself, copyright infringement. And that’s even if the goal of the circumvention is not even to infringe on the underlying copyright at all. That’s why we’ve seen attempts by companies to use 1201 to, say, block people from using cheaper ink jet cartridges, or getting a cheaper garage door opener. Neither of those sound like copyright issues (because they’re not), but companies tried to abuse 1201 by claiming they put “technological control measures” on those devices, and any “circumvention” should then be seen as infringement.

But here, Reddit is doing something even crazier. Because it’s saying that since these companies (allegedly) get around Google’s technological measures, then somehow Reddit can accuse them of violating 1201.

Reddit and Google have implemented technological measures that effectively control access to Reddit content. Both companies use advanced technological techniques, as described above, to control unauthorized, automated access to their server systems. These measures, in the ordinary course of their operation, limit the freedom and ability of users to access Reddit content, including by prohibiting automated entities from accessing search engine result pages and scraping search engine results that include Reddit content.

Defendants’ actions violate 17 U.S.C. § 1201(a)(1)(A), under which no person shall circumvent a technological measure that effectively controls access to a copyrighted work. Defendants have circumvented these measures in one or more ways, including:

a. Avoiding or bypassing Reddit’s measures entirely in order to obtain Reddit’s content and services, and the content authored by its users, that appear in Google search results; and

b. Avoiding, removing, deactivating, impairing, and/or bypassing SearchGuard and Google’s other technological control measures by using devices, systems, processes, and/or protocols, including large-volume proxy networks, to improperly gain access to Google Search results.

Let’s break this down, because we have to look at how crazy this is.

  1. They’re saying that these companies are “avoiding or bypassing” Reddit’s TCMs. But, the way they’re doing that is by not scraping Reddit. You cannot claim that it is “circumventing a TCM” to get the same content… from Google. That’s crazy.
  2. Even crazier is that they’re arguing that the defendants are circumventing Google’s TCM, even though Google isn’t even a party.
  3. They’re making this claim over content that Reddit holds no copyright over. The copyright remains with the original creator. Reddit holds a license, but a license does not grant Reddit the right to sue over that copyright.

Each one of these ideas is crazy. All three of them together is ludicrous. Reddit is claiming that these companies violated copyright law by (1) avoiding Reddit and (2) getting the content from publicly available Google searches over (3) content that Reddit has no copyright over.

And somehow that’s supposed to be copyright infringement.

This Is Not Protecting the Open Internet

Even more obnoxiously, Reddit crowns itself a protector of the open internet with this nonsense:

Because Reddit has always believed in the open internet, it takes its role as a steward of its users’ communities, discussions, and authentic human discourse seriously.

Elsewhere in the lawsuit, it says:

As articulated in its Public Content Policy, Reddit believes in an open internet, but it “do[es] not believe that third parties have a right to misuse public content just because it’s public.”

If that’s the case, then… you don’t believe in an open internet. Text and data mining is a part of the open internet. Building on the work of others is part of the open internet. You can’t just claim “we support the open internet, but not if we say you’re misusing it.” It’s not your call.

Yes, there are copyright restrictions on what you can do with others’ content, but (again) Reddit has no copyright interest here. And it can’t even legitimately claim a “circumvention” of a TCM just because these companies got the same data elsewhere.

This Isn’t Even About Training

Some people will still insist this is bad because they hate all AI training based on scraping, but that’s not even what’s happening here. We discussed this a bit in our last piece on cutting off the open internet. It’s one thing to argue that you want to block your content from being trained upon, but it’s a wholly different thing to say “you can’t retrieve this page based on a user search.” That latter scenario is the basis of how search engines exist online, which are fundamental to an open web.

But, as Perplexity notes in its response to the lawsuit (ironically, in the Perplexity subreddit on Reddit), that’s exactly what Reddit is looking to block:

What does Perplexity actually do with Reddit content? We summarize Reddit discussions, and we cite Reddit threads in answers, just like people share links to posts here all the time. Perplexity invented citations in AI for two reasons: so that you can verify the accuracy of the AI-generated answers, and so you can follow the citation to learn more and expand your journey of curiosity.

And that’s what people use Perplexity for: journeys of curiosity and learning.  When they visit Reddit to read your content it’s because they want to read it, and they read more than they would have from a Google search. 

The company also notes that Reddit demanded Perplexity license its data, but Perplexity explained to them (as mentioned above) that they don’t train their own LLM so they don’t need to license data for training.

Here’s where we push back. Reddit told the press we ignored them when they asked about licensing. Untrue. Whenever anyone asks us about content licensing, we explain that Perplexity, as an application-layer company, does not train AI models on content. Never has. So it is impossible for us to sign a license agreement to do so. 

A year ago, after explaining this, Reddit insisted we pay anyway, despite lawfully accessing Reddit data. Bowing to strong arm tactics just isn’t how we do business

For what it’s worth, Perplexity also claims that this is part of Reddit’s plan to “extort” more money from Google.

This is an Anti-Open Internet Lawsuit

If this lawsuit succeeds, it would signal a huge destruction of the open internet. It would fundamentally make it impossible for search engines to work without licensing all content. It would, in effect, close off huge parts of the open internet to only those with the largest wallets.

Beyond that, it would extend our understanding of Section 1201’s anti-circumvention provisions to absurdity. Saying that not scraping your site is circumvention? Crazy. Saying that (allegedly) “bypassing” someone else’s technological measures lets you sue? Absurd. And saying that you can do all that over content you don’t even hold the copyright on? Preposterously stupid.

If this lawsuit succeeds, it would open up a cottage industry of frivolous lawsuits, while greatly diminishing the nature of the open web.

I’ve long considered Reddit one of the “good” examples of how narrow, more focused, communities can operate. On our latest Ctrl-Alt-Speech, we talked about how it’s one of the examples of the “good” parts of the internet. I know and respect many people at Reddit, including on their legal team.

But I just don’t get this lawsuit. It seems massively destructive to the open internet in what appears to be a very misguided and mis-targeted attempt to shake down extra licensing revenue. There are better ways to do this, and I hope that Reddit reconsiders its approach.

Filed Under: , , , , , , , ,
Companies: awmproxy, oxylabs, perplexity, reddit, serpapi

Rate this comment as insightful
Rate this comment as funny
You have rated this comment as insightful
You have rated this comment as funny
Flag this comment as abusive/trolling/spam
You have flagged this comment
The first word has already been claimed
The last word has already been claimed
Insightful Lightbulb icon Funny Laughing icon Abusive/trolling/spam Flag icon Insightful badge Lightbulb icon Funny badge Laughing icon Comments icon

Comments on “Reddit’s ‘AI Scraping’ Lawsuit Is An Attack On The Open Internet”

Subscribe: RSS Leave a comment
10 Comments
GHB (profile) says:

Sigh, another 1201 attack against actions on the open web.

Similar to RIAA’s attack on youtube dl: Both involve trying to take down a tool or service on the web of an alleged “DRM” on a service they do not own (from youtube to download videos and from search engine results from google).

The difference: One tries to make an argument that “if a website does not provide a feature, than that user action of it is prohibited”, which threatens numerous browser extensions that “extends a feature” on the use of a webpage, Reddit’s attempt is to use 1201 to create an assumption that “if a site have anti-bot checks, then it is illegal to use VPNs, alternative frontends, or other competing services to consume content from us outside our service”. They don’t just restrict authorized scrapers, but 3rd party scrapers that scrape from the authorized scrapers as well. That’s why they blocked the IA and demand them to forbid AI scrapers from gathering data from the WBM. And then when such a service does that to google search, they had the audacity to directly sue that 3rd party company from gathering data from the google search result.

That is the same site that have gotten into a controversy over paywalling 3rd party apps on the use of its API (the 2023 r/place was rightfully full of f*ck spez messages), including what is reminiscent to news sites demanding a link tax to show on your search engine, AI or not (actually this happened before google implemented AI overview, demanding to be paid for snippets and linking)

Arianity (profile) says:

It would fundamentally make it impossible for search engines to work without licensing all content

Search and scraping has worked by falling under fair use, and section (c) provides: Nothing in this section shall affect rights, remedies, limitations, or defenses to copyright infringement, including fair use, under this title.

They’re saying that these companies are “avoiding or bypassing” Reddit’s TCMs. But, the way they’re doing that is by not scraping Reddit. You cannot claim that it is “circumventing a TCM” to get the same content… from Google. That’s crazy.

Why is that crazy? If reddit doesn’t want you to scrape, but you get it from Google (who is allowed to scrape), that doesn’t suddenly make it ok. That’s obviously bullshit. And it would be bad to require Google to have to sue to clean it up (for one, why would Google care at all if you’re misusing Reddit?)

They’re making this claim over content that Reddit holds no copyright over. The copyright remains with the original creator. Reddit holds a license, but a license does not grant Reddit the right to sue over that copyright.

If that were the case, then it seems like Reddit wouldn’t be able to stop any kind of data scraping or similar. But the incentive seems really bad, because it seems likely that Reddit would start asking for copyright for content posted on it. That said, the wording of the section doesn’t seem to be limited to copyright holders. There seems to already be precedent on this (see REALNETWORKS, INC. v. STREAMBOX, Bose Corporation et al v. Zavala ), establishing standing for “anyone injured”.

If that’s the case, then… you don’t believe in an open internet. Text and data mining is a part of the open internet. Building on the work of others is part of the open internet.

If you force people to choose, I think this is going to backfire, and you’re going to make a lot of people decide they don’t believe in an open internet after all.

We already have a problem where search engines are starting to kill content sites because they don’t actually drive click-throughs anymore. You say there are better ways; what are they?

Tanner Andrews (profile) says:

Re: who says

If reddit doesn’t want you to scrape, but you get it from Google (who is allowed to scrape), that doesn’t suddenly make it ok.

If it does not make it OK, then something is seriously wrong. That would allow reddit to control the speech of third parties, such as Google or Bung.

The same rule would apply to newspapers. If the Daily Stormer does not want me to have and share certain information about GOP activites, but the Daily Worker finds out and gives it to me, why should I not have the information from the one who wants to share?

I would need a really convincing reason before I would grant reddit, or the Daily Stormer, control over the speech of other potential speakers.

Anonymous Coward says:

No, circumvention is not infringement

So no, Section 1202 does not say, or mean, that “any attempt to ‘circumvent a technological measure’ that tries to protect a copyright-protected work is, itself, copyright infringement.” RFR – Read the Rule. What someone does with the work once they’ve circumvented the measure … that may or may not be copyright infringement. Reasonable minds can disagree about whether Reddit should be entitled to protect itself, or “protect” itself if you prefer–I have thoughts–but let’s try and stay on the rails.

Leave a Reply to Tanner Andrews Cancel reply

Your email address will not be published. Required fields are marked *

Have a Techdirt Account? Sign in now. Want one? Register here

Comment Options:

Make this the or (get credits or sign in to see balance) what's this?

What's this?

Techdirt community members with Techdirt Credits can spotlight a comment as either the "First Word" or "Last Word" on a particular comment thread. Credits can be purchased at the Techdirt Insider Shop »

Follow Techdirt

Techdirt Daily Newsletter

Subscribe to Our Newsletter

Get all our posts in your inbox with the Techdirt Daily Newsletter!

We don’t spam. Read our privacy policy for more info.

Ctrl-Alt-Speech

A weekly news podcast from
Mike Masnick & Ben Whitelaw

Subscribe now to Ctrl-Alt-Speech »
Techdirt needs your support! Get the first Techdirt Commemorative Coin with donations of $100
Techdirt Deals
Techdirt Insider Discord
The latest chatter on the Techdirt Insider Discord channel...
Loading...