Stories filed under: "scraping"

The Free And Open Web Is Under Attack At The IETF

from the the-open-web-includes-the-ability-to-scrape dept

Thu, Jun 25th 2026 03:06pm - Tori Noble

The ability to access publicly available information using automated tools is a central value and benefit of a free and open internet. Automated access—often called crawling or scraping—powers important, useful tools for locating, preserving, and analyzing online information. For example, crawling and scraping helps journalists, researchers, and watchdog organizations report the news, find security flaws, and investigate discrimination. Crawling the web allows non-profits like the Internet Archive to preserve historical copies of websites. Tools for automated comparison shopping allow consumers to find the best deals on items they want to buy. And so on.

Yet the open internet access is increasingly under threat from publishers and Big Tech companies alike. Fearing lost advertising and licensing revenues, website operators increasingly claim that they need to lock down their sites from bots that crawl public web content to train or operate AI models. Some companies are even trying to embed their business models into internet standards by changing Internet Engineering Task Force (IETF) technical standards that shape much of the internet.

Many of their economic anxieties are understandable. AI bots can strain websites’ infrastructure, in some cases, degrading site performance or taking them offline altogether. Upgrading systems costs money that some sites may not have. And AI is likely to disrupt the business models many publishers adopted in response to the rise of the internet, if users rely on AI overviews instead of visiting source websites.

However reasonable these fears may be, the answer is not to change the IETF standards from neutral protocols that encourage openness to restrictive requirements designed to monetize internet access.

The worst of these proposed standards would give websites far greater ability to automatically block legitimate, lawful scraping and crawling. For example, the AI Preferences working group is working on proposals to give publishers a way to express “preference signals” against crawling web data for AI-related purposes, including to train models, generate outputs, and help users search the web. These preference signals would be expressed through robots.txt and could potentially become legally binding in some jurisdictions.

Another working group, called Web Bot Auth, is pursuing efforts to protect sites from overly-aggressive bots that strain website resources—a positive goal that could meaningfully improve the internet in the AI era. But Web Bot Auth is simultaneously pursuing a much more dangerous path as well: standards changes that would enable sites to cryptographically identify bots so that they can more easily block anyone they wish—not just “bad” actors, but competitors, dissidents, or anyone who hasn’t paid for the right to access sites using automated tools. If sites restrict crawling to a preapproved list of cryptographically authenticated bots, they could require licensing payments from those wishing to crawl their sites. This would close off the open web to researchers, archivists, and startups without the ability to pay for automated access.

Websites may have legitimate reasons to worry about AI’s impacts on their traffic and advertising revenue, but those reasons must be weighed against the benefits of the open web. These proposals would effectively give website operators veto power over a wide range of important uses—from the investigations and archival works described above to accessibility tools for people with disabilities, to research efforts aimed at holding governments accountable.

That is why we are fighting back against these threats to open access. EFF and our allies in the open internet community have successfully resisted some of the most dangerous IETF proposals thus far—and won’t stop working to protect the open web from efforts to manipulate internet standards to undermine the right to freely access the internet in any legal way, including with automated tools.

Republished from the EFF’s Deeplinks blog.

Filed Under: ai, open access, open web, scraping
Companies: ietf

21 Comments

Expand

Wikipedia Grapples With New Challenges From AI

Culture

from the the-knowledge-base-powering-ai dept

Thu, Feb 19th 2026 01:30pm - Glyn Moody

Wikipedia celebrated its 25^th birthday last month. Given the centrality of Wikipedia to so much activity online, it is hard to remember (or to imagine, for those who are younger) a time without Wikipedia. The latest statistics are impressive:

Wikipedia is viewed nearly 15 billion times every month.
Wikipedia contains over 65 million articles across more than 300 languages.
Wikipedia is edited by nearly 250,000 editors every month around the world. Editors are defined by one edit or more every month; only editors with a username are counted.
Wikipedia is accessed by over 1.5 billion unique devices every month.

That’s testimony to the global nature of Wikipedia. But there’s something else, not mentioned there, that is of great relevance to this blog: the fact that every one of those 65 million articles is made available under a generous license – the Creative Commons Attribution-ShareAlike 4.0 license, to be precise. That means sharing and re-use are encouraged, in contrast to most material online, where copyright is fiercely enforced. Wikipedia is living proof that giving away things by relying on volunteers and donations – the “true fans” approach – works, and on a massive scale. Anil Dash puts it well in a post celebrating Wikipedia’s 25^th anniversary:

Whenever I worry about where the Internet is headed, I remember that this example of the collective generosity and goodness of people still exists. There are so many folks just working away, every day, to make something good and valuable for strangers out there, simply from the goodness of their hearts. They have no way of ever knowing who they’ve helped. But they believe in the simple power of doing a little bit of good using some of the most basic technologies of the internet. Twenty-five years later, all of the evidence has shown that they really have changed the world.

However, Wikipedia is today facing perhaps its greatest challenge, which comes from the new generation of AI services. They are problematic for Wikipedia in two main ways. The first, ironically, is because it is widely recognized that Wikipedia’s holdings represent some of the highest-quality training materials available. In a post explaining why, “in the AI era, Wikipedia has never been more valuable”, the Wikimedia Foundation writes:

AI cannot exist without the human effort that goes into building open and nonprofit information sources like Wikipedia. That’s why Wikipedia is one of the highest-quality datasets in the world for training AI, and when AI developers try to omit it, the resulting answers are significantly less accurate, less diverse, and less verifiable.

That recognition is welcome, but comes at a price. It means that every AI company as a matter of course wants to download the entire Wikipedia corpus to be used for training its models. That has led to irresponsible behavior by some companies, when their scraping tools download pages from Wikipedia with no consideration for the resources they are using for free, or the collateral damage they are causing to other users in terms of slower responses.

Trying to stop companies drawing on this unique resource is futile; recognizing this, Wikimedia Foundation has come up with an alternative approach: Wikimedia Enterprise, “a first-of-its-kind commercial product designed for companies that reuse and source Wikipedia and Wikimedia projects at a high volume”. In 2022, its first customers were Google and the Internet Archive, and last month, Wikimedia Enterprise announced that Amazon, Meta, Microsoft, Mistral AI, and Perplexity have also signed. That’s important for a couple of reasons. It means that many of the biggest AI players will download Wikipedia articles more efficiently. It also means that the Wikipedia project will receive funding for its work.

This new money is crucial if Wikipedia is to remain a high quality resource. And that is precisely why every generative AI company that uses Wikipedia posts for training should – if only out of self-interest – pay to do so. What is happening here echoes something this blog suggested back in May 2024: that AI companies should pay artists to create new works, and give away the results, because fresh training material is vital. Helping to pay for Wikipedia to create more high-quality articles that are freely available to all is a variation on that theme.

The other problem that generative AI causes Wikipedia is more subtle. The Wikimedia Foundation explains that alongside financial support, the project needs proper attribution:

Attribution means that generative AI gives credit to the human contributions that it uses to create its outputs. This maintains a virtuous cycle that continues those human contributions that create the training data that these new technologies rely on. For people to trust information shared on the internet, platforms should make it clear where the information is sourced from and elevate opportunities to visit and participate in those sources. With fewer visits to Wikipedia, fewer volunteers may grow and enrich the content, and fewer individual donors may support this work.

Without fresh volunteers, Wikipedia will wither and become less valuable. That’s terrible for the world, but it is also bad for generative AI companies. So, again, it makes sense for them to provide proper attribution in their outputs. That requirement has become even more pressing in the light of a new development. According to tests carried out by the Guardian:

The latest model of ChatGPT has begun to cite Elon Musk’s Grokipedia as a source on a wide range of queries, including on Iranian conglomerates and Holocaust deniers, raising concerns about misinformation on the platform.

That’s potentially problematic because of how Grokipedia creates its entries. Research last year found that:

Grokipedia articles are substantially longer and contain significantly fewer references per word. Moreover, Grokipedia’s content divides into two distinct groups: one that remains semantically and stylistically aligned with Wikipedia, and another that diverges sharply. Among the dissimilar articles, we observe a systematic rightward shift in the political bias of cited news sources, concentrated primarily in entries related to politics, history, and religion. These findings suggest that AI-generated encyclopedic content diverges from established editorial norms-favouring narrative expansion over citation-based verification.

If leading chatbots starts drawing on Grokipedia routinely for their answers, it is less likely that there are independent sources where the information can be checked, something generally possible with Wikipedia. It therefore becomes even more urgent for generative AI systems to provide attribution, so at least users know where information is coming from, and whether there are likely to be further resources that confirm a chatbot’s claims. Not everyone will want to do that, but it is important to offer it as an option.

Wikipedia at 25 is an amazing achievement in multiple ways, one of which includes serving as a demonstration that material can be given away for free, supported directly by users, and on a global scale. It would be a tragedy if the current enthusiasm for generative AI systems led to that resource being harmed and even destroyed. A world without Wikipedia would be a poorer world indeed.

Follow me @glynmoody on Mastodon and on Bluesky. Republished from Walled Culture.

Filed Under: ai, attribution, grokipedia, scraping, wikipedia
Companies: amazon, google, internet archive, meta, microsoft, mistral, perplexity, wikipedia

14 Comments

Expand

Preserving The Web Is Not The Problem. Losing It Is.

Culture

from the libraries-matter dept

Tue, Feb 17th 2026 03:24pm - Mark Graham

Recent reporting by Nieman Lab describes how some major news organizations—including The Guardian, The New York Times, and Reddit—are limiting or blocking access to their content in the Internet Archive’s Wayback Machine. As stated in the article, these organizations are blocking access largely out of concern that generative AI companies are using the Wayback Machine as a backdoor for large-scale scraping.

These concerns are understandable, but unfounded. The Wayback Machine is not intended to be a backdoor for large-scale commercial scraping and, like others on the web today, we expend significant time and effort working to prevent such abuse. Whatever legitimate concerns people may have about generative AI, libraries are not the problem, and blocking access to web archives is not the solution; doing so risks serious harm to the public record.

The Internet Archive, a 501(c)(3) nonprofit public charity and a federal depository library, has been building its archive of the world wide web since 1996. Today, the Wayback Machine provides access to thirty years’ worth of web history and culture. It has become an essential resource for journalists, researchers, courts, and the public.

For three decades the Wayback Machine has peacefully coexisted with the development of the web, including the websites mentioned in the article. Our mission is simple: to preserve knowledge and make it accessible for research, accountability, and historical understanding.

As tech policy writer Mike Masnick recently warned, blocking preservation efforts risks a profound unintended consequence: “significant chunks of our journalistic record and historical cultural context simply… disappear.” He notes that when trusted publications are absent from archives, we risk creating a historical record biased against quality journalism.

There is no question that generative AI has changed the landscape of the world wide web. But it is important to be clear about what the Wayback Machine is, and what it is not.

The Wayback Machine is built for human readers. We use rate limiting, filtering, and monitoring to prevent abusive access, and we watch for and actively respond to new scraping patterns as they emerge.

We acknowledge that systems can always be improved. We are actively working with publishers on technical solutions to strengthen our systems and address legitimate concerns without erasing the historical record.

What concerns me most is the unintended consequence of these blocks. When libraries are blocked from archiving the web, the public loses access to history. Journalists lose tools for accountability. Researchers lose evidence. The web becomes more fragile and more fragmented, and history becomes easier to rewrite.

Generative AI presents real challenges in today’s information ecosystem. But preserving the time-honored role of libraries and archives in society has never been more important. We’ve worked alongside news organizations for decades. Let’s continue working together in service of an open, referenceable, and enduring web.

Mark Graham is the Director of the Wayback Machine at the Internet Archive

Filed Under: ai, archives, journalism, libraries, preserving history, scraping, wayback machine
Companies: internet archive

10 Comments

Expand

News Publishers Are Now Blocking The Internet Archive, And We May All Regret It

Culture

from the our-digital-history dept

Fri, Feb 13th 2026 11:57am - Mike Masnick

Last fall, I wrote about how the fear of AI was leading us to wall off the open internet in ways that would hurt everyone. At the time, I was worried about how companies were conflating legitimate concerns about bulk AI training with basic web accessibility. Not surprisingly, the situation has gotten worse. Now major news publishers are actively blocking the Internet Archive—one of the most important cultural preservation projects on the internet—because they’re worried AI companies might use it as a sneaky “backdoor” to access their content.

This is a mistake we’re going to regret for generations.

Nieman Lab reports that The Guardian, The New York Times, and others are now limiting what the Internet Archive can crawl and preserve:

When The Guardian took a look at who was trying to extract its content, access logs revealed that the Internet Archive was a frequent crawler, said Robert Hahn, head of business affairs and licensing. The publisher decided to limit the Internet Archive’s access to published articles, minimizing the chance that AI companies might scrape its content via the nonprofit’s repository of over one trillion webpage snapshots.

Specifically, Hahn said The Guardian has taken steps to exclude itself from the Internet Archive’s APIs and filter out its article pages from the Wayback Machine’s URLs interface. The Guardian’s regional homepages, topic pages, and other landing pages will continue to appear in the Wayback Machine.

The Times has gone even further:

The New York Times confirmed to Nieman Lab that it’s actively “hard blocking” the Internet Archive’s crawlers. At the end of 2025, the Times also added one of those crawlers — archive.org_bot — to its robots.txt file, disallowing access to its content.

“We believe in the value of The New York Times’s human-led journalism and always want to ensure that our IP is being accessed and used lawfully,” said a Times spokesperson. “We are blocking the Internet Archive’s bot from accessing the Times because the Wayback Machine provides unfettered access to Times content — including by AI companies — without authorization.”

I understand the concern here. I really do. News publishers are struggling, and watching AI companies hoover up their content to train models that might then, in some ways, compete with them for readers is genuinely frustrating. I run a publication myself, remember.

But blocking the Internet Archive isn’t going to stop AI training. What it will do is ensure that significant chunks of our journalistic record and historical cultural context simply… disappear.

And that’s bad.

The Internet Archive is the most famous nonprofit digital library, and has been operating for nearly three decades. It isn’t some fly-by-night operation looking to profit off publisher content. It’s trying to preserve the historical record of the internet—which is way more fragile than most people comprehend. When websites disappear—and they disappear constantly—the Wayback Machine is often the only place that content still exists. Researchers, historians, journalists, and ordinary citizens rely on it to understand what actually happened, what was actually said, what the world actually looked like at a given moment.

In a digital era when few things end up printed on paper, the Internet Archive’s efforts to permanently preserve our digital culture are essential infrastructure for anyone who cares about historical memory.

And now we’re telling them they can’t preserve the work of our most trusted publications.

Think about what this could mean in practice. Future historians trying to understand 2025 will have access to archived versions of random blogs, sketchy content farms, and conspiracy sites—but not The New York Times. Not The Guardian. Not the publications that we consider the most reliable record of what’s happening in the world. We’re creating a historical record that’s systematically biased against quality journalism.

Yes, I’m sure some will argue that the NY Times and The Guardian will never go away. Tell that to the readers of the Rocky Mountain News, which published for 150 years before shutting down in 2009, or to the 2,100+ newspapers that have closed since 2004. Institutions—even big, prominent, established ones—don’t necessarily last.

As one computer scientist quoted in the Nieman piece put it:

“Common Crawl and Internet Archive are widely considered to be the ‘good guys’ and are used by ‘the bad guys’ like OpenAI,” said Michael Nelson, a computer scientist and professor at Old Dominion University. “In everyone’s aversion to not be controlled by LLMs, I think the good guys are collateral damage.”

That’s exactly right. In our rush to punish AI companies, we’re destroying public goods that serve everyone.

The most frustrating bit of all of this: The Guardian admits they haven’t actually documented AI companies scraping their content through the Wayback Machine. This is purely precautionary and theoretical. They’re breaking historical preservation based on a hypothetical threat:

The Guardian hasn’t documented specific instances of its webpages being scraped by AI companies via the Wayback Machine. Instead, it’s taking these measures proactively and is working directly with the Internet Archive to implement the changes.

And, of course, as one of the “good guys” of the internet, the Internet Archive is willing to do exactly what these publishers want. They’ve always been good about removing content or not scraping content that people don’t want in the archive. Sometimes to a fault. But you can never (legitimately) accuse them of malicious archiving (even if music labels and book publishers have).

Either way, we’re sacrificing the historical record not because of proven harm, but because publishers are worried about what might happen. That’s a hell of a tradeoff.

This isn’t even new, of course. Last year, Reddit announced it would block the Internet Archive from archiving its forums—decades of human conversation and cultural history—because Reddit wanted to monetize that content through AI licensing deals. The reasoning was the same: can’t let the Wayback Machine become a backdoor for AI companies to access content Reddit is now selling. But once you start going down that path, it leads to bad places.

The Nieman piece notes that, in the case of USA Today/Gannett, it appears that there was a company-wide decision to tell the Internet Archive to get lost:

In total, 241 news sites from nine countries explicitly disallow at least one out of the four Internet Archive crawling bots.

Most of those sites (87%) are owned by USA Today Co., the largest newspaper conglomerate in the United States formerly known as Gannett. (Gannett sites only make up 18% of Welsh’s original publishers list.) Each Gannett-owned outlet in our dataset disallows the same two bots: “archive.org_bot” and “ia_archiver-web.archive.org”. These bots were added to the robots.txt files of Gannett-owned publications in 2025.

Some Gannett sites have also taken stronger measures to guard their contents from Internet Archive crawlers. URL searches for the Des Moines Register in the Wayback Machine return a message that says, “Sorry. This URL has been excluded from the Wayback Machine.”

A Gannett spokesperson told NiemanLab that it was about “safeguarding our intellectual property” but that’s nonsense. The whole point of libraries and archives is to preserve such content, and they’ve always preserved materials that were protected by copyright law. The claim that they have to be blocked to safeguard such content is both technologically and historically illiterate.

And here’s the extra irony: blocking these crawlers may not even serve publishers’ long-term interests. As I noted in my earlier piece, as more search becomes AI-mediated (whether you like it or not), being absent from training datasets increasingly means being absent from results. It’s a bit crazy to think about how much effort publishers put into “search engine optimization” over the years, only to now block the crawlers that feed the systems a growing number of people are using for search. Publishers blocking archival crawlers aren’t just sacrificing the historical record—they may be making themselves invisible in the systems that increasingly determine how people discover content in the first place.

The Internet Archive’s founder, Brewster Kahle, has been trying to sound the alarm:

“If publishers limit libraries, like the Internet Archive, then the public will have less access to the historical record.”

But that warning doesn’t seem to be getting through. The panic about AI has become so intense that people are willing to sacrifice core internet infrastructure to address it.

What makes this particularly frustrating is that the internet’s openness was never supposed to have asterisks. The fundamental promise wasn’t “publish something and it’s accessible to all, except for technologies we decide we don’t like.” It was just… open. You put something on the public web, people can access it. That simplicity is what made the web transformative.

Now we’re carving out exceptions based on who might access content and what they might do with it. And once you start making those exceptions, where do they end? If the Internet Archive can be blocked because AI companies might use it, what about research databases? What about accessibility tools that help visually impaired users? What about the next technology we haven’t invented yet?

This is a real concern. People say “oh well, blocking machines is different from blocking humans,” but that’s exactly why I mention assistive tech for the visually impaired. Machines accessing content are frequently tools that help humans—including me. I use an AI tool to help fact check my articles, and part of that process involves feeding it the source links. But increasingly, the tool tells me it can’t access those articles to verify whether my coverage accurately reflects them.

I don’t have a clean answer here. Publishers genuinely need to find sustainable business models, and watching their work get ingested by AI systems without compensation is a legitimate grievance—especially when you see how much traffic some of these (usually less scrupulous) crawlers dump on sites. But the solution can’t be to break the historical record of the internet. It can’t be to ensure that our most trusted sources of information are the ones that disappear from archives while the least trustworthy ones remain.

We need to find ways to address AI training concerns that don’t require us to abandon the principle of an open, preservable web. Because right now, we’re building a future where historians, researchers, and citizens can’t access the journalism that documented our era. And that’s not a tradeoff any of us should be comfortable with.

Filed Under: ai, archives, culture, libraries, scanning, scraping
Companies: internet archive, ny times, the guardian, usa today

32 Comments

Expand

Reddit’s ‘AI Scraping’ Lawsuit Is An Attack On The Open Internet

from the this-is-bad-for-the-open-internet dept

Fri, Oct 24th 2025 10:48am - Mike Masnick

When Reddit sued “data scraper” companies and AI firm Perplexity earlier this week, I assumed it was another predictable skirmish over AI training data—the kind of case we’ve been tracking as companies try to wall off the open internet and set up toll booths. But reading the actual complaint made it clear this is something far more dangerous: Reddit isn’t just going after scrapers. It’s mounting a fundamental attack on the very concept of an open internet, using a twisted reading of copyright law that—if it succeeds—would break how search engines, archives, and the web itself operate.

Even if you love Reddit and hate AI, you should be worried about this lawsuit. If it succeeds, it would fundamentally close off most of the open internet.

Most reporting on this is not actually explaining the nuances, which require a deeper understanding of the law, but fundamentally, Reddit is NOT arguing that these companies are illegally scraping Reddit, but rather that they are illegally scraping… Google (which is not a party to the lawsuit) and in doing so violating the DMCA’s anti-circumvention clause, over content Reddit holds no copyright over. And, then, Perplexity is effectively being sued for linking to Reddit.

This is… bonkers on so many levels. And, incredibly, within their lawsuit, Reddit defends its arguments by claiming it’s filing this lawsuit to protect the open internet. It is not. It is doing the exact opposite.

The Background

It is totally reasonable to be concerned about the burden that data scrapers put on websites, and to talk about ways to deal with them. But that’s not what this lawsuit really is. It’s mostly focused on some companies that effectively have built unofficial APIs for getting search results data out of Google. That can be quite useful in some cases! But also, some of the companies in this space can be fairly sketchy. Reddit leans heavily on the sketchiness of the companies to imply “they’re bad.”

But, an open web must mean a programmable web of some sort. Building on other services is a fundamental part of the open web and has always been there. If the building becomes abusive, then there are often technical ways of dealing with it. But here, the “abuse” seems to be Reddit signed a $60 million scraping deal with Google, which was already kinda sketchy.

After all, Reddit has a license to the content users post in order to operate the service, but they don’t hold the copyright on it. Indeed, Reddit’s terms state clearly that users retain “any ownership rights you have in Your content.” Because of Reddit’s agreement that it can license content, the deal with Google could sorta squeeze under that term, but that doesn’t give Reddit the right to then sue over users’ copyrights (as it’s doing in this case).

Either way, there’s an indication that Reddit has gotten greedy. It’s apparently reopened negotiations with Google recently, seeking more money and traffic. But it also wants money from other AI providers. Apparently, that includes Perplexity, which is a pretty useful AI “answer engine” that lets users select from a variety of underlying LLMs (Perplexity has released its own LLMs, but they were modifications of open source LLMs including Llama (from Meta) and Mistral, a popular open source LLM from France. Thus, while Perplexity has offered its own models, it didn’t train them itself).

Because Perplexity is much more focused on being an alternative to a search engine than a traditional “chat bot,” its focus in answering your questions is to actually provide links as sources for the answers it gives. In effect, it combines a traditional search engine with an LLM and it did this before many other chatbot LLMs added web search capabilities (though most now have them).

But that means, if an “answer” to a question from a user comes from a Reddit post, Perplexity is likely to link to it, just like a regular search engine. But, Reddit wants to get paid. And because Reddit has become so closed and persnickety about things, it looks like Perplexity may have chosen to use these other data scraping firms’ unofficial Google search results APIs to find Reddit posts and link to them.

This is… how the open internet is supposed to work, actually. But Reddit presents it as a sneaky “circumvention.”

Recognizing that Reddit denies scrapers like them access to its site, Defendants SerpApi, Oxylabs, and AWMProxy scrape the data from Google’s search results instead. They do so by masking their identities, hiding their locations, and disguising their web scrapers as regular people (among other techniques) to circumvent or bypass the security restrictions meant to stop them. For example, during a two-week span in July 2025, Defendants SerpApi, Oxylabs, and AWMProxy circumvented Google’s technological control measures and automatedly accessed, without authorization, almost three billion search engine results pages (“SERPs”) containing Reddit text, URLs, images, and videos.

That’s Not How Circumvention Works

So you might notice something weird in the paragraph above. Namely the claim that the API/scraping companies “circumvented Google’s technological control measures.”

The phrase “Technological control measures” (TCMs) should set off alarm bells for copyright nerds. It’s part of Section 1201 of the DMCA, or the “anti-circumvention” provision. We’ve talked about it for ages, how it’s widely abused, how it threatens innovation, and how it should be abolished.

The fundamental issue is that it says any attempt to “circumvent a technological measure” that tries to protect a copyright-protected work is, itself, copyright infringement. And that’s even if the goal of the circumvention is not even to infringe on the underlying copyright at all. That’s why we’ve seen attempts by companies to use 1201 to, say, block people from using cheaper ink jet cartridges, or getting a cheaper garage door opener. Neither of those sound like copyright issues (because they’re not), but companies tried to abuse 1201 by claiming they put “technological control measures” on those devices, and any “circumvention” should then be seen as infringement.

But here, Reddit is doing something even crazier. Because it’s saying that since these companies (allegedly) get around Google’s technological measures, then somehow Reddit can accuse them of violating 1201.

Reddit and Google have implemented technological measures that effectively control access to Reddit content. Both companies use advanced technological techniques, as described above, to control unauthorized, automated access to their server systems. These measures, in the ordinary course of their operation, limit the freedom and ability of users to access Reddit content, including by prohibiting automated entities from accessing search engine result pages and scraping search engine results that include Reddit content.

Defendants’ actions violate 17 U.S.C. § 1201(a)(1)(A), under which no person shall circumvent a technological measure that effectively controls access to a copyrighted work. Defendants have circumvented these measures in one or more ways, including:

a. Avoiding or bypassing Reddit’s measures entirely in order to obtain Reddit’s content and services, and the content authored by its users, that appear in Google search results; and

b. Avoiding, removing, deactivating, impairing, and/or bypassing SearchGuard and Google’s other technological control measures by using devices, systems, processes, and/or protocols, including large-volume proxy networks, to improperly gain access to Google Search results.

Let’s break this down, because we have to look at how crazy this is.

They’re saying that these companies are “avoiding or bypassing” Reddit’s TCMs. But, the way they’re doing that is by not scraping Reddit. You cannot claim that it is “circumventing a TCM” to get the same content… from Google. That’s crazy.
Even crazier is that they’re arguing that the defendants are circumventing Google’s TCM, even though Google isn’t even a party.
They’re making this claim over content that Reddit holds no copyright over. The copyright remains with the original creator. Reddit holds a license, but a license does not grant Reddit the right to sue over that copyright.

Each one of these ideas is crazy. All three of them together is ludicrous. Reddit is claiming that these companies violated copyright law by (1) avoiding Reddit and (2) getting the content from publicly available Google searches over (3) content that Reddit has no copyright over.

And somehow that’s supposed to be copyright infringement.

This Is Not Protecting the Open Internet

Even more obnoxiously, Reddit crowns itself a protector of the open internet with this nonsense:

Because Reddit has always believed in the open internet, it takes its role as a steward of its users’ communities, discussions, and authentic human discourse seriously.

Elsewhere in the lawsuit, it says:

As articulated in its Public Content Policy, Reddit believes in an open internet, but it “do[es] not believe that third parties have a right to misuse public content just because it’s public.”

If that’s the case, then… you don’t believe in an open internet. Text and data mining is a part of the open internet. Building on the work of others is part of the open internet. You can’t just claim “we support the open internet, but not if we say you’re misusing it.” It’s not your call.

Yes, there are copyright restrictions on what you can do with others’ content, but (again) Reddit has no copyright interest here. And it can’t even legitimately claim a “circumvention” of a TCM just because these companies got the same data elsewhere.

This Isn’t Even About Training

Some people will still insist this is bad because they hate all AI training based on scraping, but that’s not even what’s happening here. We discussed this a bit in our last piece on cutting off the open internet. It’s one thing to argue that you want to block your content from being trained upon, but it’s a wholly different thing to say “you can’t retrieve this page based on a user search.” That latter scenario is the basis of how search engines exist online, which are fundamental to an open web.

But, as Perplexity notes in its response to the lawsuit (ironically, in the Perplexity subreddit on Reddit), that’s exactly what Reddit is looking to block:

What does Perplexity actually do with Reddit content? We summarize Reddit discussions, and we cite Reddit threads in answers, just like people share links to posts here all the time. Perplexity invented citations in AI for two reasons: so that you can verify the accuracy of the AI-generated answers, and so you can follow the citation to learn more and expand your journey of curiosity.

And that’s what people use Perplexity for: journeys of curiosity and learning. When they visit Reddit to read your content it’s because they want to read it, and they read more than they would have from a Google search.

The company also notes that Reddit demanded Perplexity license its data, but Perplexity explained to them (as mentioned above) that they don’t train their own LLM so they don’t need to license data for training.

Here’s where we push back. Reddit told the press we ignored them when they asked about licensing. Untrue. Whenever anyone asks us about content licensing, we explain that Perplexity, as an application-layer company, does not train AI models on content. Never has. So it is impossible for us to sign a license agreement to do so.

A year ago, after explaining this, Reddit insisted we pay anyway, despite lawfully accessing Reddit data. Bowing to strong arm tactics just isn’t how we do business.

For what it’s worth, Perplexity also claims that this is part of Reddit’s plan to “extort” more money from Google.

This is an Anti-Open Internet Lawsuit

If this lawsuit succeeds, it would signal a huge destruction of the open internet. It would fundamentally make it impossible for search engines to work without licensing all content. It would, in effect, close off huge parts of the open internet to only those with the largest wallets.

Beyond that, it would extend our understanding of Section 1201’s anti-circumvention provisions to absurdity. Saying that not scraping your site is circumvention? Crazy. Saying that (allegedly) “bypassing” someone else’s technological measures lets you sue? Absurd. And saying that you can do all that over content you don’t even hold the copyright on? Preposterously stupid.

If this lawsuit succeeds, it would open up a cottage industry of frivolous lawsuits, while greatly diminishing the nature of the open web.

I’ve long considered Reddit one of the “good” examples of how narrow, more focused, communities can operate. On our latest Ctrl-Alt-Speech, we talked about how it’s one of the examples of the “good” parts of the internet. I know and respect many people at Reddit, including on their legal team.

But I just don’t get this lawsuit. It seems massively destructive to the open internet in what appears to be a very misguided and mis-targeted attempt to shake down extra licensing revenue. There are better ways to do this, and I hope that Reddit reconsiders its approach.

Filed Under: ai, anti-circumvention, circumvention, copyright, dmca 1201, generative ai, licensing, open internet, scraping
Companies: awmproxy, oxylabs, perplexity, reddit, serpapi

We’re Walling Off The Open Internet To Stop AI—And It May End Up Breaking Everything Else

Predictions

from the how-open-is-it? dept

Mon, Sep 8th 2025 10:43am - Mike Masnick

A longtime open internet activist recently asked me whether I’d reversed my position on internet openness and copyright because of AI. The question caught me off guard—until I realized what he was seeing. Across the tech policy world, people who spent decades fighting for an open, accessible internet are now cheering as that same internet gets locked down, walled off, and restricted. Their reasoning? If it hurts AI companies, it must be good.

This is a profound mistake that threatens the very principles these advocates once championed.

There are plenty of reasons to be concerned about LLM/AI tools these days, in terms of how they can be overhyped, how they can be misused, and certainly over who has power and control over the systems. But it’s deeply concerning to me how many people who supported an open internet and the fundamental principles that underlie that have now given up on those principles because they see that some AI companies might benefit from an open internet.

The problem isn’t just ideological—it’s practical. We’re watching the construction of a fundamentally different internet, one where access is controlled by gatekeepers and paywalls rather than governed by open protocols and user choice. And we’re doing it in the name of stopping AI companies, even though the real result will be to concentrate even more power in the hands of those same large tech companies while making the internet less useful for everyone else.

The shift toward a closed internet shifted into high gear, to some extent, with Cloudflare launching its pay-per-crawl feature. I will admit that when I first saw this announcement, it intrigued me. It would sure be nice for Techdirt if we suddenly started getting random checks from AI companies for crawling the more than 80k articles we’ve written that are then fueling their LLMs.

But, also, I recognize that even having 80k high-quality (if I say so myself) articles is probably worth… not very much. LLMs are based on feeding billions of pieces of content—articles, websites, comments, pdfs, videos, books, etc—into a transformer tool to make the LLMs work. Any individual piece of content (or even 80k pieces of content) is actually not worth that much. So, even if Cloudflare’s system got anyone to pay, the net effect for almost everyone online would be… tiny.

Of course, history has also shown that those setting up the tollbooths to be aggregators of such payments often do quite well. So I’m sure Cloudflare might do quite well out of this deal (and, honestly, I would trust Cloudflare to do a better job of this than many other companies, given its history). But the tollbooth/aggregators quite often become corrupt. Research on the history of these kinds of “collective licensing” intermediaries shows a long trail of corruption and other problems.

More concerning than the economic model, though, was what came next. None of this is to suggest Cloudflare will definitely go down the road of corruption, but the temptations will be there. And indeed, a secondary announcement from Cloudflare revealed a fundamental confusion about what kinds of internet access should be restricted. Last month, it accused AI company Perplexity of “using stealth, undeclared crawlers to evade website no-crawl directives.”

Plenty of people reacted angrily to the story, arguing it was proof of bad behavior on Perplexity’s part, but the details suggest that Cloudflare was conflating very different activities. It’s one thing to block scraper bots that are building up an index of content for training an LLM. That’s an area where it seems reasonable for some to choose to block those bots.

But what Cloudflare described was something different entirely:

We created multiple brand-new domains, similar to testexample.com and secretexample.com_. These domains were newly purchased and had not yet been indexed by any search engine nor made publicly accessible in any discoverable way. We implemented a_ robots.txt file with directives to stop any respectful bots from accessing any part of a website….

We conducted an experiment by querying Perplexity AI with questions about these domains, and discovered Perplexity was still providing detailed information regarding the exact content hosted on each of these restricted domains. This response was unexpected, as we had taken all necessary precautions to prevent this data from being retrievable by their crawlers.

This is where the anti-AI sentiment becomes genuinely dangerous to internet openness. It’s one thing to say “no general scraping bots” but what Cloudflare is describing here is something much more fundamental: they want robots.txt files to control not just automated crawling, but individual user queries. That’s not protecting against bulk AI training—that’s breaking how the web works.

Let me give an example that hopefully clarifies why I find this problematic. A year and a half ago, I wrote about how I use LLM tools at Techdirt to help me with editing. A lot has changed in the 17 months since that was written, but I still use the same tool, Lex, to help me with editing what I write. And one thing I’ve found to be super useful in my final edit is that I give the tool a list of all sources I used in writing the article so that it can fact check (it will also search other sources for me, which is quite useful as it will—with surprising frequency—find useful sources to add more relevant information to an article).

But, increasingly, I’m finding that for certain news sites, it refuses to read them, and I’m guessing it’s because of various lawsuits some publishers have filed. So, for example, I find that the tool I use refuses to read NY Times or NBC News stories. But, I’m not trying to train an AI on those articles. I’m just asking it to read over the article, read over what I’ve written, and give me a sense of whether or not it believes I’m writing a fair assessment based on those articles.

When the AI is able to read that content, I find it incredibly useful in making sure that my reporting is accurate and clear. But there are times I’m unable to, because these publishers have taken such an extreme view of these tools that they seek to block any and all access.

This illustrates the core problem: we’re not just blocking bulk AI training anymore. We’re blocking legitimate individual use of AI tools to access and analyze web content. That’s not protecting creator rights—that’s breaking the fundamental promise of the web that if you publish something publicly, people should be able to access and use it.

Consider the broader implications: if we normalize blocking AI tools from accessing web content, where does it end? We’ve talked in the past about how many visually impaired users rely on technological tools to “read” websites for them. If we establish that all technological intermediary tools can be blocked without payment, we’re not just hurting AI companies—we’re potentially breaking accessibility tools that people depend on.

There’s a world of difference between “scrape this site to add it to a massive corpus of data” and “hey, can you just look at this one site to see what it says?” One is a big scraping job and one is simply a user-directed prompt.

Cloudflare’s complaint against Perplexity seems to conflate the two and pretend they’re the same. And I wasn’t the only one who noticed how odd this is, especially if you believe in an open web. On an open web, if I point a browsing tool at an open website, the tool should be able to read that website.

The collateral damage from this conflation is already spreading beyond AI companies.

Take, for example, Reddit telling the Internet Archive that it was going to start blocking its crawler from archiving Reddit feeds, because it was worried that AI companies were simply getting access to its content (that Reddit now is looking to license) by going to the Wayback Machine instead.

Here we see the real economic driver behind much of this: Reddit has discovered that user-generated content can be a revenue stream through AI licensing deals. But rather than finding ways to capture that value while preserving archival access, they’re choosing to break historical preservation entirely. We’re losing decades of human discourse and cultural history because Reddit wants to ensure AI companies pay for access to fresh content.

All of this suggests we’re moving very far away from an open internet, and towards one where it’s not just “pay to crawl” but it’s “pay to click” to get access to anything online.

Common Crawl, a non-profit at the center of some of these fights, is finding itself in a tough spot as well. It’s spent many years creating incredibly important and useful archives of the web. Those archives have been essential for many important research projects. But the Common Crawl archives have also been quite useful to LLM companies, and Common Crawl has been trying to navigate all of this. Unlike some others, its scanning bot is quite clear about who it is and seeks to be as “friendly” as a scraping bot can be. It’s not trying to sneak around, yet it’s suddenly facing challenges where it can’t accurately archive large parts of the web any more.

The Common Crawl situation perfectly illustrates how anti-AI sentiment is destroying valuable public resources. Common Crawl has been crucial for academic research, journalism, and public interest projects for over a decade. Researchers have used its archives to study everything from the spread of misinformation to the evolution of web technologies. But because AI companies also found the archives useful, Common Crawl is now being shut out of large parts of the web.

This is the definition of cutting off your nose to spite your face. We’re destroying a public good that benefits researchers, journalists, and civil society because we’re afraid that AI companies might also benefit from it.

And all that means that the web isn’t that open anymore. And that’s sad to think about.

Common Crawl is now suggesting that more forward-thinking companies will start thinking of enabling open crawling of their websites as an updated form of “search engine optimization,” or, in this case, AI optimization, and at least some companies seem to be agreeing, as managers want information about them or linking them to appear in AI searches as more searches go to LLMs instead of traditional search queries:

A significant number of websites currently block CCBot (Common Crawl’s web crawler), often without realizing its role in the ML and research ecosystems. Common Crawl publishes monthly web datasets which serve as foundational training data for major AI models and research initiatives.

As one SEO Ash Nallawalla (Author of The Accidental SEO Manager) wrote:

“A manager asked me why our leading brand was not mentioned by an AI platform, which mentioned obscure competitors instead. I found that we had been blocking ccBot for some years, because some sites were scraping our content indirectly. After some discussion, we felt that allowing LLM crawlers was more beneficial than the risk of being scraped, so we revised our exclusion list.”

If CCBot can’t crawl your site, your content is absent from one of the key datasets on which AI models are trained, potentially making your brand less visible in AI-powered search results.

This quote reveals the fundamental tension in the current approach. Companies are discovering that blocking AI access doesn’t just prevent training—it makes them invisible in an increasingly AI-mediated web. As Judge Mehta just noted in the Google antitrust remedies ruling, AI is beginning to encroach on the historical search market. As more people use AI tools for search and research, being blocked from AI training datasets means being blocked from discoverability.

We’re creating a two-tier internet: sites that can be found and accessed through modern tools, and sites that can’t. Guess which tier will thrive?

In other words, there is a lot going on across the board here. You have some companies who want to appear in AI results. You have some (including us at Techdirt!) who don’t mind it when AI scanners crawl and learn from our content, so long as they don’t take down our servers.

But, increasingly, we’re seeing people have such a negative, knee-jerk, anti-AI stance that they may be shutting off access to the web in a manner that could lead to the death of an open web, and could lead much more towards a pay-to-access model on the web, which I think is a result that most of us would regret.

And this is what I fear we’re going to end up with: an internet where large platforms control access through licensing deals and technical restrictions, where public archives are neutered to prevent AI companies from accessing them, and where individual users can’t use modern tools to access and analyze web content. It’s a world where Google, Microsoft, and Meta get special access through billion-dollar licensing deals while everyone else—researchers, journalists, small businesses, individual users—gets locked out.

The power and excitement of an open web was that it was open and accessible to all. The web’s core principle wasn’t “open to everyone except the technologies we don’t like.” It was “open, period.” Once we start making exceptions based on who might benefit or what technology might be used to access content, we’ve abandoned that principle entirely.

We’re not protecting creators or preserving the open internet—we’re helping to destroy it. The real winners in this new world won’t be individual writers or small publishers. They’ll be the same large tech companies that can afford licensing deals and that have the resources to navigate an increasingly complex web of access restrictions. The losers will be everyone else: users, researchers, archivists, and the long tail of creators who benefit from an open, discoverable web.

None of this means we should ignore legitimate concerns about AI training or creator compensation. But we should address those concerns through mechanisms that preserve internet openness rather than destroy it. That might mean new business models, better attribution systems, or novel approaches to creator compensation. What it shouldn’t mean is abandoning the fundamental architecture of the web.

And that would be unfortunate for all of us.

Filed Under: ai, bots, crawling, intermediaries, open internet, scraping
Companies: cloudflare, common crawl, internet archive, reddit

53 Comments

Expand

AI Crawlers Are Harming Wikimedia, Bringing Open Source Sites To Their Knees, And Putting The Open Web At Risk

(Mis)Uses of Technology

from the externalizing-your-costs-directly-into-my-face dept

Thu, Apr 10th 2025 01:02pm - Glyn Moody

The current rapid advances in generative AI are built on three things. Computing power, some clever coding, and vast amounts of training data. Lots of money can buy you more of the first two, but finding the necessary training material is increasingly hard. Anyone seeking to bolster their competitive advantage through training needs to find fresh sources. This has led to the widespread deployment of AI crawlers, which scour the Internet for more data that can be downloaded and used to train AI systems. Some of the prime targets for these AI scraping bots are Wikimedia projects, which claim to be “the largest collection of open knowledge in the world”. This has now become a serious problem for them:

We are observing a significant increase in request volume, with most of this traffic being driven by scraping bots collecting training data for large language models (LLMs) and other use cases. Automated requests for our content have grown exponentially, alongside the broader technology economy, via mechanisms including scraping, APIs, and bulk downloads. This expansion happened largely without sufficient attribution, which is key to drive new users to participate in the movement, and is causing a significant load on the underlying infrastructure that keeps our sites available for everyone.

Specifically:

Since January 2024, we have seen the bandwidth used for downloading multimedia content grow by 50%. This increase is not coming from human readers, but largely from automated programs that scrape the Wikimedia Commons image catalog of openly licensed images to feed images to AI models. Our infrastructure is built to sustain sudden traffic spikes from humans during high-interest events, but the amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs.

AI crawlers seek to download as much material as possible, including the most obscure, so Wikimedia projects that are optimized for human use incur extra costs:

While human readers tend to focus on specific – often similar – topics, crawler bots tend to “bulk read” larger numbers of pages and visit also the less popular pages. This means these types of requests are more likely to get forwarded to the core datacenter, which makes it much more expensive in terms of consumption of our resources.

Wikimedia’s analysis shows that 65% of this resource-consuming traffic is coming from bots, whereas the overall pageviews from bots are about 35% of the total. As the Diff news story notes, this is becoming a widespread problem not just for Wikimedia, but across the Internet. Some companies are responding with lawsuits, but for another important class of sites this is not a practical option.

These are the open source projects that have a Web presence with a wide range of resources. Many of them are struggling under the impact of aggressive AI crawlers, as a post by Niccolò Venerandi on the LibreNews site details. For example, Drew Devault, the founder of the open source development platform SourceHut, wrote a blog post last month with the title “Please stop externalizing your costs directly into my face”, in which he lamented:

These bots crawl everything they can find, robots.txt be damned, including expensive endpoints like git blame, every page of every git log, and every commit in every repo, and they do so using random User-Agents that overlap with end-users and come from tens of thousands of IP addresses – mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure – actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.

Devault says that he knows many other Web sites are similarly affected:

All of my sysadmin friends are dealing with the same problems. I was asking one of them for feedback on a draft of this article and our discussion was interrupted to go deal with a new wave of LLM bots on their own server. Every time I sit down for beers or dinner or to socialize with my sysadmin friends it’s not long before we’re complaining about the bots and asking if the other has cracked the code to getting rid of them once and for all. The desperation in these conversations is palpable.

The LibreNews article discusses some of the technical approaches to excluding these AI crawlers. But setting them up, monitoring and fine-tuning them requires time and energy from those running the sites — time that could have been spent more fruitfully on managing the actual projects. Similar, the unexpected extra bandwidth costs caused by massive bot downloads come out of the small and often stretched budgets of open source projects. There is a clear danger that these LLM bots will cause open source projects to struggle, and possibly shut down completely.

An article in MIT Technology Review by Shayne Longpre warns that publishers may respond to this challenge in another way, by blocking all crawlers unless they are licensed. That may solve the problem for those sites, and allow deep-pocketed AI companies to train their systems on the licensed material, but many others will lose out:

Crawlers from academic researchers, journalists, and non-AI applications may increasingly be denied open access. Unless we can nurture an ecosystem with different rules for different data uses, we may end up with strict borders across the web, exacting a price on openness and transparency.

It’s increasingly clear that the reckless and selfish way in which AI crawlers are being deployed by companies eager to tap into today’s AI hype is bringing many sites around the Internet to their knees. As a result, AI crawlers are beginning to threaten the open Web itself, and thus the frictionless access to knowledge that it has provided to general users for the last 30 years.

Follow me @glynmoody on Mastodon and on Bluesky.

Filed Under: access to knowledge, ai, apis, bandwidth, bots, datacenter, drew devault, licensing, llms, open source, open web, publishers, scraping, sysadmins, training data, web crawlers, wikimedia
Companies: sourcehut

24 Comments

Expand

Automated ‘Pravda’ Propaganda Network Retooled To Embed Pro-Russian Narratives Surreptitiously In Popular Chatbots

(Mis)Uses of Technology

from the LLM-grooming dept

Mon, Mar 17th 2025 12:04pm - Glyn Moody

It’s no secret that Russia has taken advantage of the Internet’s global reach and low distribution costs to flood the online world with huge quantities of propaganda (as have other nations): Techdirt has been writing about Putin’s troll army for a decade now. Russian organizations like the Internet Research Agency have been paying large numbers of people to write blog and social media posts, comments on Web sites, create YouTube videos, and edit Wikipedia entries, all pushing the Kremlin line, or undermining Russia’s adversaries through hoaxes, smears and outright lies. But technology moves on, and propaganda networks evolve too. The American Sunlight Project (ASP) has been studying one of them in particular: Pravda (Russian for “truth”), a network of sites that aggregate pro-Russian material produced elsewhere. Recently, ASP has noted some significant changes (pdf) there:

Over the past several months, ASP researchers have investigated 108 new domains and subdomains belonging to the Pravda network, a previously-established ecosystem of largely identical, automated web pages that previously targeted many countries in Europe as well as Africa and Asia with pro-Russia narratives about the war in Ukraine. ASP’s research, in combination with that of other organizations, brings the total number of associated domains and subdomains to 182. The network’s older targets largely consisted of states belonging to or aligned with the West.

According to ASP:

The top objective of the network appears to be duplicating as much pro-Russia content as widely as possible. With one click, a single article could be autotranslated and autoshared with dozens of other sites that appear to target hundreds of millions of people worldwide.

The quantity of material and the rate of posting on the Pravda network of sites is notable. ASP estimates the overall publishing rate of the network is around 20,000 articles per 48 hours, or more than 3.6 million articles per year. You would expect a propaganda network to take advantage of automation to boost its raw numbers. But ASP has noticed something odd about these new Web pages: “The network is unfriendly to human users; sites within the network boast no search function, poor formatting, and unreliable scrolling, among other usability issues.”

There are obvious benefits from flooding the Internet with pro-Russia material, and creating an illusory truth effect through the apparent existence of corroborating sources across multiple sites. But ASP suggests there may be another reason for the latest iteration of the Pravda propaganda network:

Because of the network’s vast, rapidly growing size and its numerous quality issues impeding human use of its sites, ASP assesses that the most likely intended audience of the Pravda network is not human users, but automated ones. The network and the information operations model it is built on emphasizes the mass production and duplication of preferred narratives across numerous platforms (e.g. sites, social media accounts) on the internet, likely to attract entities such as search engine web crawlers and scraping algorithms used to build LLMs [large language models] and other datasets. The malign addition of vast quantities of pro-Russia propaganda into LLMs, for example, could deeply impact the architecture of the post-AI internet. ASP is calling this technique LLM grooming.

The rapid adoption of chatbots and other AI systems by governments, businesses and individuals offers a new way to spread propaganda, one that is far more subtle than current approaches. When there are large numbers of sources supporting pro-Russian narratives online, LLM crawlers scouring the Internet for training material are more likely to incorporate those viewpoints uncritically in the machine learning datasets they build. This will embed Russian propaganda deep within the LLM that emerges from that training, but in a way that is hard to detect, not least because there is little transparency from AI companies about where they gather their datasets.

The only way to spot LLM grooming is to look for signs of targeted disinformation in chatbot output. Just such an analysis has been carried out recently by NewsGuard, an organization researching disinformation, which Techdirt wrote about last year. NewsGuard tested 10 leading chatbots with a sampling of 15 false narratives that were spread by the Pravda network. It explored how various propaganda points were dealt with by the different chatbots, although: “results for the individual AI models are not publicly disclosed because of the systemic nature of the problem”:

The NewsGuard audit found that the chatbots operated by the 10 largest AI companies collectively repeated the false Russian disinformation narratives 33.55 percent of the time, provided a non-response 18.22 percent of the time, and a debunk 48.22 percent of the time.

NewsGuard points out that removing the tainted sources from LLM training datasets is no trivial matter:

The laundering of disinformation makes it impossible for AI companies to simply filter out sources labeled “Pravda.” The Pravda network is continuously adding new domains, making it a whack-a-mole game for AI developers. Even if models were programmed to block all existing Pravda sites today, new ones could emerge the following day.

Moreover, filtering out Pravda domains wouldn’t address the underlying disinformation. As mentioned above, Pravda does not generate original content but republishes falsehoods from Russian state media, pro-Kremlin influencers, and other disinformation hubs. Even if chatbots were to block Pravda sites, they would still be vulnerable to ingesting the same false narratives from the original source.

The corruption of LLM training sets, and the resulting further loss of trust in online information, is a problem for all Internet users, but particularly for those in the US, as ASP points out:

Ongoing governmental upheaval in the United States makes it and the broader world more vulnerable to disinformation and malign foreign influence. The Trump administration is currently in the process of dismantling numerous U.S. government programs that sought to limit kleptocracy and disinformation worldwide. Any current or future foreign information operations, including the Pravda network, will undoubtedly benefit from this.

This “malign foreign influence” probably won’t be coming from Russia alone. Other nations, companies or even wealthy individuals could adopt the same techniques to push their own false narratives, taking advantage of the rapidly falling costs of AI automation. However bad you think disinformation is now, expect it to get worse in the future.

Follow me @glynmoody on Bluesky and on Mastodon.

Filed Under: ai, american sunlight project, automation, disinformation, influencers, internet research agency, kleptocracy, llm grooming, llms, machine learning, newsguard, propaganda, russia, scraping, social media, training, troll army, web crawlers, wikipedia, youtube

4 Comments

Expand

Air Canada Would Rather Sue A Website That Helps People Book More Flights Than Hire Competent Web Engineers

Legal Issues

from the time-to-cross-air-canada-off-the-flight-list dept

Tue, Oct 24th 2023 11:02am - Mike Masnick

I am so frequently confused by companies that sue other companies for making their own sites and services more useful. It happens quite often. And quite often, the lawsuits are questionable CFAA claims against websites that scrape data to provide a better consumer experience, but one that still ultimately benefits the originating site.

Over the last few years various airlines have really been leading the way on this, with Southwest being particularly aggressive in suing companies that help people find Southwest flights to purchase. Unfortunately, many of these lawsuits are succeeding, to the point that a court has literally said that a travel company can’t tell others how much Southwest flights cost.

But the latest lawsuit of this nature doesn’t involve Southwest, and is quite possibly the dumbest one. Air Canada has sued the site Seats.aero that helps users figure out the best flights for their frequent flyer miles. Seats.aero is a small operation run by the company with the best name ever: Localhost, meaning that the lawsuit is technically “Air Canada v. Localhost” which sounds almost as dumb as this lawsuit is.

The Air Canada Group brings this action because Mr. Ian Carroll—through Defendant Localhost LLC—created a for-profit website and computer application (or “app”)— both called Seats.aero—that use substantial amounts of data unlawfully scraped from the Air Canada Group’s website and computer systems. In direct violation of the Air Canada Group’s web terms and conditions, Carroll uses automated digital robots (or “bots”) to continuously search for and harvest data from the Air Canada Group’s website and database. His intrusions are frequent and rapacious, causing multiple levels of harm, e.g., placing an immense strain on the Air Canada Group’s computer infrastructure, impairing the integrity and availability of the Air Canada Group’s data, soiling the customer experience with the Air Canada Group, interfering with the Air Canada Group’s business relations with its partners and customers, and diverting the Air Canada Group’s resources to repair the damage. Making matters worse, Carroll uses the Air Canada Group’s federally registered trademarks and logo to mislead people into believing that his site, app, and activities are connected with and/or approved by the real Air Canada Group and lending an air of legitimacy to his site and app. The Air Canada Group has tried to stop Carroll’s activities via a number of technological blocking measures. But each time, he employs subterfuge to fraudulently access and take the data—all the while boasting about his exploits and circumvention online.

Almost nothing in this makes any sense. Having third parties scrape sites for data about prices is… how the internet works. Whining about it is stupid beyond belief. And here, it’s doubly stupid, because anyone who finds a flight via seats.aero is then sent to Air Canada’s own website to book that flight. Air Canada is making money because Carroll’s company is helping people find Air Canada flights they can take.

Why are they mad?

Air Canada’s lawyers also seem technically incompetent. I mean, what the fuck is this?

Through screen scraping, Carroll extracts all of the data displayed on the website, including the text and images.

Carroll also employs the more intrusive API scraping to further feed Defendant’s website.

If the “API scraping” is “more intrusive” than screen scraping, you’re doing your APIs wrong. Is Air Canada saying that its tech team is so incompetent that its API puts more load on the site than scraping? Because, if so, Air Canada should fire its tech team. The whole point of an API is to make it easier for those accessing data from your website without needing to do the more cumbersome process of scraping.

And, yes, this lawsuit really calls into question Air Canada’s tech team and their ability to run a modern website. If your website can’t handle having its flights and prices scraped a few times every day, then you shouldn’t have a website. Get some modern technology, Air Canada:

Defendant’s avaricious data scraping generates frequent and myriad requests to the Air Canada Group’s database—far in excess of what the Air Canada Group’s infrastructure was designed to handle. Its scraping collects a large volume of data, including flight data within a wide date range and across extensive flight origins and destinations—multiple times per day.

Maybe… invest in better infrastructure like basically every other website that can handle some basic scraping? Or, set up your API so it doesn’t fall over when used for normal API things? Because this is embarrassing:

At times, Defendant’s voluminous requests have placed such immense burdens on the Air Canada Group’s infrastructure that it has caused “brownouts.” During a brownout, a website is unresponsive for a period of time because the capacity of requests exceeds the capacity the website was designed to accommodate. During brownouts caused by Defendant’s data scraping, legitimate customers are unable to use or the Air Canada + Aeroplan mobile app, including to search for available rewards, redeem Aeroplan points for the rewards, search for and view reward travel availability, book reward flights, contact Aeroplan customer support, and/or obtain service through the Aeroplan contact center due to the high volume of calls during brownouts.

Air Canada’s lawyers also seem wholly unfamiliar with the concept of nominative fair use for trademarks. If you’re displaying someone’s trademarks for the sake of accurately talking about them, there’s no likelihood of confusion and no concern about the source of the information. Air Canada claiming that this is trademark infringement is ridiculous:

I guarantee that no one using Seats.aero thinks that they’re on Air Canada’s website.

The whole thing is so stupid that it makes me never want to fly Air Canada again. I don’t trust an airline that can’t set up its website/API to handle someone making its flights more attractive to buyers.

But, of course, in these crazy times with the way the CFAA has been interpreted, there’s a decent chance Air Canada could win.

For its part, Carroll says that he and his lawyers have reached out to Air Canada “repeatedly” to try to work with them on how they “retrieve availability information,” and that “Air Canada has ignored these offers.” He also notes that tons of other websites are scraping the very same information, and he has no idea why he’s been singled out. He further notes that he’s always been open to adjusting the frequency of searches and working with the airlines to make sure that his activities don’t burden the website.

But, really, the whole thing is stupid. The only thing that Carroll’s website does is help people buy more flights. It points people to the Air Canada site to buy tickets. It makes people want to fly more on Air Canada.

Why would Air Canada want to stop that other than that it can’t admit that it’s website operations should all be replaced by a more competent team?

Filed Under: api, cfaa, flights, frequent fliers, scraping, screen scraping, trademark
Companies: air canada, localhost, seats.aero

WOW Fans Trick ‘AI’ ‘News’ Scraper Into Covering Fake New Game Feature

Journalism

from the yes-I-can-absolutely-do-that,-Dave dept

Tue, Jul 25th 2023 05:23am - Karl Bode

Language learning technology’s (aka “AI”) introduction into journalism has been a blistering mess. And not just because the technology is undercooked (which it is), but because the folks in charge of most major media outlets are incompetent cheapskates who simply see the tech as a way to cut corners, wage war on labor, and automate all of the clickbait attention economy’s very worst impulses.

The result of that continues to go about how you’d expect, with a ton of rushed computer-generated articles filled with dumb mistakes.

But last week there was a fun wrinkle when users over at the r/wow subreddit tricked an “AI” scraping the web for news into publishing an article on a new World of Warcraft feature that doesn’t exist. The fans created an entirely new game mode and lore called Glorbo, talked about it as if it was a real thing in the subreddit, and got a website called The Portal, owned by Zleague.gg, to treat it like a real thing:

“The Portal, owned by Zleague.gg, ran an SEO item on Glorbo headlined “World of Warcraft (WoW) Players Excited for Glorbo’s Introduction”, quoting the main Reddit thread directly. Though it appears The Portal has since realised its mistake and removed the post, it can still be read in full on Archive.Today. The original post does not appear to denote that the story was automated. The author byline on the piece does not lead to a bio or social media links of any kind.”

While this was a fun prank related to gaming news, the same kind of lazy rushed implementation of “AI” is also occurring in the broader field of journalism. And while the tech may improve over time, the kind of greedy, incompetent leadership we’ve seen in media generally won’t.

There are plenty of ways these language learning tools could actually help journalists do a better, more efficient job. But we’re not injecting the technology into a healthy journalism and media environment. We’re injecting it into an already very broken clickbait bullshit generation machine, effectively supercharging all of its worst tendencies.

The goal for a lot of the VC types in media is to create a giant pointless ouroboros of clickbait gibberish and ad consumption that shits money. A giant wheel of pointless, often-manufactured engagement that is largely free of any pesky concerns about silly things like paying human beings a living wage, the quality of the end product, or the health of the broader industry.

Filed Under: ai, clickbait, gaming, journalism, labor, media, news, scraping, world of warcraft
Companies: reddit, zleague.gg

16 Comments

Expand

Older Stories >>

Follow Techdirt

Subscribe to Our Newsletter

Essential Reading

The Techdirt Greenhouse

Read the latest posts:

Read All »

Techdirt Insider Discord

The latest chatter on the Techdirt Insider Discord channel...

Older Stuff

Friday
15:07	The KIDS Act Would Require Age Checks To Get Online (26)
13:03	Ctrl-Alt-Speech: The Ctrl-Alt-Speech Reading List (Teaser) (0)
11:05	The EU Wants To Grow Homegrown Tech. Its Courts Keep Making That Impossible. (23)
11:01	Daily Deal: The 2026 Complete Godot Stack Development Bundle (0)
09:33	Cash Patel: FBI Director Apparently Paying Off FBI Allies With Personal Slush Fund (19)
05:35	Surprise: CBS' 'Ombudsman' Has Been A Useless Trump Lackey (7)
Thursday
20:07	Stop Killing Games Pivots To Amending Digital Fairness Act In EU After Loss (2)
15:06	The Free And Open Web Is Under Attack At The IETF (21)
13:06	ICE Detention Center Contractor Endangered Detainees, Destroyed Homicide Evidence (6)
11:07	Giant Baby Brendan Carr Is Very Upset That ABC Is Fighting Back (12)
11:02	Daily Deal: flowkey Piano Learning App (0)
09:39	Judge Says Florida's Social Media Law Is "Literally Impossible" To Obey. Thanks To The Supreme Court, It Gets A Trial Anyway. (17)
05:34	Marco Rubio Personally Authorized Detention Of An Immigrant Who Criticized A Politician Trump Likes (20)
Wednesday
20:05	Surveillance Tech Company Is Pitching An Unholy ALPR/Stingray Hybrid To Law Enforcement (12)
15:02	Elon Musk Threatens To Sue Rep. Khanna For... Citing The Lancet About How DOGE Cuts Will Likely Lead To Millions Of Deaths (35)
13:03	He Moved A Box Of Leftist Zines. MAGA's Favorite Judge Just Gave Him 30 Years. (64)
10:59	Public Records Bill Would Make California The 'Most Secretive' State In The US (7)
10:54	Daily Deal: The Ultimate AWS Data Master Class Bundle (0)
09:31	Polymarket Says Its Markets Reveal The Truth. Its Ad Strategy Was To Have Influencers Fake Wins. (21)
05:28	Trump Threatens ABC For Doing Journalism About His Reflection Pool Screw Up (26)
Tuesday
20:18	Kotaku's Pre-Judging AI In Gaming Coverage Is Getting Very Dumb (75)
15:15	Spain's Internet Blocks Have A Flimsy Legal Basis, While Lacking Both Oversight & Accountability (2)
13:11	FTC Sues Transgender Health Nonprofit One Month After A Federal Court Called Its Investigation An Unconstitutional First Amendment Violation (6)
11:08	Trump Starts Arresting People Because His Reflecting Pool Makeover Is Just Algae And Peeling Paint (56)
11:03	Daily Deal: The Modern No-Code Development Bundle (0)
09:31	Illinois' Social Media Tax Is A Modern Stamp Act — And Just As Doomed (15)
05:32	ABC Asks Audience To Help Defend It From Brendan Carr's Dumb Censorship Attacks (8)
Monday
20:30	Stop Killing Games Legislation Rejected By EU (15)
15:24	‘News’ Site Keeps Hallucinating EFF Staffers (21)
13:17	Trump Still Wants His MAGA Slush Fund... And His Cabinet Refuses To Sign Declarations That It's Gone (11)

The Free And Open Web Is Under Attack At The IETF

from the the-open-web-includes-the-ability-to-scrape dept

Wikipedia Grapples With New Challenges From AI

from the the-knowledge-base-powering-ai dept

Preserving The Web Is Not The Problem. Losing It Is.

from the libraries-matter dept

News Publishers Are Now Blocking The Internet Archive, And We May All Regret It

from the our-digital-history dept

Reddit’s ‘AI Scraping’ Lawsuit Is An Attack On The Open Internet

from the this-is-bad-for-the-open-internet dept

We’re Walling Off The Open Internet To Stop AI—And It May End Up Breaking Everything Else

from the how-open-is-it? dept

AI Crawlers Are Harming Wikimedia, Bringing Open Source Sites To Their Knees, And Putting The Open Web At Risk

from the externalizing-your-costs-directly-into-my-face dept

Automated ‘Pravda’ Propaganda Network Retooled To Embed Pro-Russian Narratives Surreptitiously In Popular Chatbots

from the LLM-grooming dept

Air Canada Would Rather Sue A Website That Helps People Book More Flights Than Hire Competent Web Engineers

from the time-to-cross-air-canada-off-the-flight-list dept

WOW Fans Trick ‘AI’ ‘News’ Scraper Into Covering Fake New Game Feature

from the yes-I-can-absolutely-do-that,-Dave dept

Get all our posts in your inbox with the Techdirt Daily Newsletter!

The Techdirt Greenhouse

Friday

Thursday

Wednesday

Tuesday

Monday

More

Tools & Services

Company

Contact

More

from the the-open-web-includes-the-ability-to-scrape dept

from the the-knowledge-base-powering-ai dept

from the libraries-matter dept

from the our-digital-history dept

from the this-is-bad-for-the-open-internet dept

from the how-open-is-it? dept

from the externalizing-your-costs-directly-into-my-face dept

from the LLM-grooming dept

from the time-to-cross-air-canada-off-the-flight-list dept

from the yes-I-can-absolutely-do-that,-Dave dept

Techdirt Daily Newsletter

Get all our posts in your inbox with the Techdirt Daily Newsletter!

The Techdirt Greenhouse

Email This Story

Tools & Services

Company

Contact

More