News Publishers Are Now Blocking The Internet Archive, And We May All Regret It
Last fall, I wrote about how the fear of AI was leading us to wall off the open internet in ways that would hurt everyone. At the time, I was worried about how companies were conflating legitimate concerns about bulk AI training with basic web accessibility. Not surprisingly, the situation has gotten worse. Now major news publishers are actively blocking the Internet Archive—one of the most important cultural preservation projects on the internet—because they’re worried AI companies might use it as a sneaky “backdoor” to access their content.
This is a mistake we’re going to regret for generations.
Nieman Lab reports that The Guardian, The New York Times, and others are now limiting what the Internet Archive can crawl and preserve:
When The Guardian took a look at who was trying to extract its content, access logs revealed that the Internet Archive was a frequent crawler, said Robert Hahn, head of business affairs and licensing. The publisher decided to limit the Internet Archive’s access to published articles, minimizing the chance that AI companies might scrape its content via the nonprofit’s repository of over one trillion webpage snapshots.
Specifically, Hahn said The Guardian has taken steps to exclude itself from the Internet Archive’s APIs and filter out its article pages from the Wayback Machine’s URLs interface. The Guardian’s regional homepages, topic pages, and other landing pages will continue to appear in the Wayback Machine.
The Times has gone even further:
The New York Times confirmed to Nieman Lab that it’s actively “hard blocking” the Internet Archive’s crawlers. At the end of 2025, the Times also added one of those crawlers — archive.org_bot — to its robots.txt file, disallowing access to its content.
“We believe in the value of The New York Times’s human-led journalism and always want to ensure that our IP is being accessed and used lawfully,” said a Times spokesperson. “We are blocking the Internet Archive’s bot from accessing the Times because the Wayback Machine provides unfettered access to Times content — including by AI companies — without authorization.”
I understand the concern here. I really do. News publishers are struggling, and watching AI companies hoover up their content to train models that might then, in some ways, compete with them for readers is genuinely frustrating. I run a publication myself, remember.
But blocking the Internet Archive isn’t going to stop AI training. What it will do is ensure that significant chunks of our journalistic record and historical cultural context simply… disappear.
And that’s bad.
The Internet Archive is the most famous nonprofit digital library, and has been operating for nearly three decades. It isn’t some fly-by-night operation looking to profit off publisher content. It’s trying to preserve the historical record of the internet—which is way more fragile than most people comprehend. When websites disappear—and they disappear constantly—the Wayback Machine is often the only place that content still exists. Researchers, historians, journalists, and ordinary citizens rely on it to understand what actually happened, what was actually said, what the world actually looked like at a given moment.
In a digital era when few things end up printed on paper, the Internet Archive’s efforts to permanently preserve our digital culture are essential infrastructure for anyone who cares about historical memory.
And now we’re telling them they can’t preserve the work of our most trusted publications.
Think about what this could mean in practice. Future historians trying to understand 2025 will have access to archived versions of random blogs, sketchy content farms, and conspiracy sites—but not The New York Times. Not The Guardian. Not the publications that we consider the most reliable record of what’s happening in the world. We’re creating a historical record that’s systematically biased against quality journalism.
Yes, I’m sure some will argue that the NY Times and The Guardian will never go away. Tell that to the readers of the Rocky Mountain News, which published for 150 years before shutting down in 2009, or to the 2,100+ newspapers that have closed since 2004. Institutions—even big, prominent, established ones—don’t necessarily last.
As one computer scientist quoted in the Nieman piece put it:
“Common Crawl and Internet Archive are widely considered to be the ‘good guys’ and are used by ‘the bad guys’ like OpenAI,” said Michael Nelson, a computer scientist and professor at Old Dominion University. “In everyone’s aversion to not be controlled by LLMs, I think the good guys are collateral damage.”
That’s exactly right. In our rush to punish AI companies, we’re destroying public goods that serve everyone.
The most frustrating bit of all of this: The Guardian admits they haven’t actually documented AI companies scraping their content through the Wayback Machine. This is purely precautionary and theoretical. They’re breaking historical preservation based on a hypothetical threat:
The Guardian hasn’t documented specific instances of its webpages being scraped by AI companies via the Wayback Machine. Instead, it’s taking these measures proactively and is working directly with the Internet Archive to implement the changes.
And, of course, as one of the “good guys” of the internet, the Internet Archive is willing to do exactly what these publishers want. They’ve always been good about removing content or not scraping content that people don’t want in the archive. Sometimes to a fault. But you can never (legitimately) accuse them of malicious archiving (even if music labels and book publishers have).
Either way, we’re sacrificing the historical record not because of proven harm, but because publishers are worried about what might happen. That’s a hell of a tradeoff.
This isn’t even new, of course. Last year, Reddit announced it would block the Internet Archive from archiving its forums—decades of human conversation and cultural history—because Reddit wanted to monetize that content through AI licensing deals. The reasoning was the same: can’t let the Wayback Machine become a backdoor for AI companies to access content Reddit is now selling. But once you start going down that path, it leads to bad places.
The Nieman piece notes that, in the case of USA Today/Gannett, it appears that there was a company-wide decision to tell the Internet Archive to get lost:
In total, 241 news sites from nine countries explicitly disallow at least one out of the four Internet Archive crawling bots.
Most of those sites (87%) are owned by USA Today Co., the largest newspaper conglomerate in the United States formerly known as Gannett. (Gannett sites only make up 18% of Welsh’s original publishers list.) Each Gannett-owned outlet in our dataset disallows the same two bots: “archive.org_bot” and “ia_archiver-web.archive.org”. These bots were added to the robots.txt files of Gannett-owned publications in 2025.
Some Gannett sites have also taken stronger measures to guard their contents from Internet Archive crawlers. URL searches for the Des Moines Register in the Wayback Machine return a message that says, “Sorry. This URL has been excluded from the Wayback Machine.”
A Gannett spokesperson told NiemanLab that it was about “safeguarding our intellectual property” but that’s nonsense. The whole point of libraries and archives is to preserve such content, and they’ve always preserved materials that were protected by copyright law. The claim that they have to be blocked to safeguard such content is both technologically and historically illiterate.
And here’s the extra irony: blocking these crawlers may not even serve publishers’ long-term interests. As I noted in my earlier piece, as more search becomes AI-mediated (whether you like it or not), being absent from training datasets increasingly means being absent from results. It’s a bit crazy to think about how much effort publishers put into “search engine optimization” over the years, only to now block the crawlers that feed the systems a growing number of people are using for search. Publishers blocking archival crawlers aren’t just sacrificing the historical record—they may be making themselves invisible in the systems that increasingly determine how people discover content in the first place.
The Internet Archive’s founder, Brewster Kahle, has been trying to sound the alarm:
“If publishers limit libraries, like the Internet Archive, then the public will have less access to the historical record.”
But that warning doesn’t seem to be getting through. The panic about AI has become so intense that people are willing to sacrifice core internet infrastructure to address it.
What makes this particularly frustrating is that the internet’s openness was never supposed to have asterisks. The fundamental promise wasn’t “publish something and it’s accessible to all, except for technologies we decide we don’t like.” It was just… open. You put something on the public web, people can access it. That simplicity is what made the web transformative.
Now we’re carving out exceptions based on who might access content and what they might do with it. And once you start making those exceptions, where do they end? If the Internet Archive can be blocked because AI companies might use it, what about research databases? What about accessibility tools that help visually impaired users? What about the next technology we haven’t invented yet?
This is a real concern. People say “oh well, blocking machines is different from blocking humans,” but that’s exactly why I mention assistive tech for the visually impaired. Machines accessing content are frequently tools that help humans—including me. I use an AI tool to help fact check my articles, and part of that process involves feeding it the source links. But increasingly, the tool tells me it can’t access those articles to verify whether my coverage accurately reflects them.
I don’t have a clean answer here. Publishers genuinely need to find sustainable business models, and watching their work get ingested by AI systems without compensation is a legitimate grievance—especially when you see how much traffic some of these (usually less scrupulous) crawlers dump on sites. But the solution can’t be to break the historical record of the internet. It can’t be to ensure that our most trusted sources of information are the ones that disappear from archives while the least trustworthy ones remain.
We need to find ways to address AI training concerns that don’t require us to abandon the principle of an open, preservable web. Because right now, we’re building a future where historians, researchers, and citizens can’t access the journalism that documented our era. And that’s not a tradeoff any of us should be comfortable with.


I have been quite clear that CEOs who think it does more than it does are wrong and the market will take care of that when they fail.
No. This is legitimately interesting and important technology and I will not stop talking about the implications (both good and bad) of it. Or how to think about improving the good while mitigating the bad. I get that some of you wish to stick your head in the sand and pretend the tech will magically go away. I think that is a very silly position. Techdirt has always been about discussing the implications of technological change and innovation. And I'm not going to stop doing that just because you don't like the fact that people want to actually discuss both the good and bad aspects of it.I have employed an editor for decades. They still edit everything I write. But what I deliver to them is much stronger because of all the work the tools I use help me put into it beforehand. I honestly find this criticism (which I got before) some of the dumber criticisms I've ever gotten. Like, I've always had a full time person on staff who edits everything I write. This tech doesn't replace them. But it makes their job easier. It allows them to focus on bigger picture thinking, so I find I get much more value out of my human editor as well. Why do you assume otherwise?
Your set of questions is kind of odd. You discuss a bunch of bad uses which I did not endorse, and most of which I've spoken out against. You brought up "now owning computer hardware any more" when... I'm doing most of my AI work on hardware I own. I don't get it. And your final paragraph is wrong. I'm not trying to convince anyone to use it. If you don't want to use it, don't use it. But the adoption rate of this tech is WAY WAY WAY WAY faster than any of those other technologies you mentioned. So, it's literally the reverse of what you are saying. Many people are finding value. Yes, there is way to much hype and marketing nonsense around it. But the adoption rates are insane. My only hope is that for the people who are choosing to adopt it, that they learn to use the tech appropriately, in a productive way. I agree that way too many are not. But people are using it. Pretending it will go away or that people don't get value out of it is silly. Instead, I'd prefer to live in the real world and talk about ways to make the tech better, and to avoid some of the downsides you discuss. Pretending the tech will just go away is a sign of an unserious person.
Of course no one wants to read slop. But what makes you say this is slop?
I know this term made the rounds today... and beyond the amusing nature of the repurposed acronym, it honestly confuses me. Is it that you don't want to read any articles that try to grapple with the impact of AI? Would you prefer a Techdirt that doesn't touch on a topic that is impacting much of the tech world? (Realizing this can be read in an accusatory manner, and it's not meant to be... I'm genuinely curious what you are trying to get across with this comment).
I get the skepticism, but having worked with them on various projects going back many years, I think it's wrong to characterize them the way you have. Yes, they effectively splintered out of Heartland, but specifically because they very much disagreed with Heartland on a bunch of issues, and even the link you presented shows the many ways in which they stray pretty far from the standard Federalist Society/Heritage positions on things many of us find worthwhile, like climate and mail-in voting. Also, it's cool to call stuff out like this, but I do wish people would engage with the actual material in the article? Damning people solely for their association (while misrepresenting that association) is pretty weak sauce.
I've read a ton on this and the general conclusion is... there's really no good way to calculate it, but generally speaking, using a small amount of AI api calls or running my local machine a little more often (I run some stuff locally) is making a negligible difference, not unlike many other things I do every day.
You didn't answer how many people are on Truth Social. Lol. Anyway, you have no clue what you're talking about regarding, well, anything, so no surprise you don't understand what's happening with atproto. We'll see what you have to say as it continues to grow and succeed. Re: Murthy, they absolutely agreed with me. As I explained to you earlier this week by quoting from the ruling, and your response was to pretend the ruling didn't happen and there was no ruling? Dude. You're not fooling anyone. And, yes, the video was incredibly racist as pretty much everyone, even some of your MAGA cultists, have admitted. Pretending that calling it racist is racist says way way way more about you than anyone else. You're a pathetic loser, and everyone knows it.
My goodness. I honestly thought you were just a gullible cultist. I didn't realize just how absolutely fucking stupid you are. Do you need someone to help you tie your shoes? Let's try from the beginning: do you know what the word "illegal" means? Because it's not what you seem to claim in this comment.
Your stupidity reaches new heights.
- Disagreeing with a judges ruling alone is not grounds for impeachment.
- The complaint against Boasberg is not about his ruling. I'm guessing you read literally none of the details, which is what you always do, because you are all about repeating MAGA talking points, and actually understanding details would diminish your ability to be so confidently wrong.
- Even if you were talking about the JGG case, you're simply wrong. No one has said his orders were "illegal." That is something that did not happen.
You are either incredibly stupid or lying. You should stop getting your news from idiots on X.Ah, your specialty. You don't address a single argument. You don't deal with what the ruling actually said. You don't address the missing Attachment A. You just insist he should be impeached. Fucking hilarious how pathetic you are.