News Publishers Are Now Blocking The Internet Archive, And We May All Regret It
from the our-digital-history dept
Last fall, I wrote about how the fear of AI was leading us to wall off the open internet in ways that would hurt everyone. At the time, I was worried about how companies were conflating legitimate concerns about bulk AI training with basic web accessibility. Not surprisingly, the situation has gotten worse. Now major news publishers are actively blocking the Internet Archive—one of the most important cultural preservation projects on the internet—because they’re worried AI companies might use it as a sneaky “backdoor” to access their content.
This is a mistake we’re going to regret for generations.
Nieman Lab reports that The Guardian, The New York Times, and others are now limiting what the Internet Archive can crawl and preserve:
When The Guardian took a look at who was trying to extract its content, access logs revealed that the Internet Archive was a frequent crawler, said Robert Hahn, head of business affairs and licensing. The publisher decided to limit the Internet Archive’s access to published articles, minimizing the chance that AI companies might scrape its content via the nonprofit’s repository of over one trillion webpage snapshots.
Specifically, Hahn said The Guardian has taken steps to exclude itself from the Internet Archive’s APIs and filter out its article pages from the Wayback Machine’s URLs interface. The Guardian’s regional homepages, topic pages, and other landing pages will continue to appear in the Wayback Machine.
The Times has gone even further:
The New York Times confirmed to Nieman Lab that it’s actively “hard blocking” the Internet Archive’s crawlers. At the end of 2025, the Times also added one of those crawlers — archive.org_bot — to its robots.txt file, disallowing access to its content.
“We believe in the value of The New York Times’s human-led journalism and always want to ensure that our IP is being accessed and used lawfully,” said a Times spokesperson. “We are blocking the Internet Archive’s bot from accessing the Times because the Wayback Machine provides unfettered access to Times content — including by AI companies — without authorization.”
I understand the concern here. I really do. News publishers are struggling, and watching AI companies hoover up their content to train models that might then, in some ways, compete with them for readers is genuinely frustrating. I run a publication myself, remember.
But blocking the Internet Archive isn’t going to stop AI training. What it will do is ensure that significant chunks of our journalistic record and historical cultural context simply… disappear.
And that’s bad.
The Internet Archive is the most famous nonprofit digital library, and has been operating for nearly three decades. It isn’t some fly-by-night operation looking to profit off publisher content. It’s trying to preserve the historical record of the internet—which is way more fragile than most people comprehend. When websites disappear—and they disappear constantly—the Wayback Machine is often the only place that content still exists. Researchers, historians, journalists, and ordinary citizens rely on it to understand what actually happened, what was actually said, what the world actually looked like at a given moment.
In a digital era when few things end up printed on paper, the Internet Archive’s efforts to permanently preserve our digital culture are essential infrastructure for anyone who cares about historical memory.
And now we’re telling them they can’t preserve the work of our most trusted publications.
Think about what this could mean in practice. Future historians trying to understand 2025 will have access to archived versions of random blogs, sketchy content farms, and conspiracy sites—but not The New York Times. Not The Guardian. Not the publications that we consider the most reliable record of what’s happening in the world. We’re creating a historical record that’s systematically biased against quality journalism.
Yes, I’m sure some will argue that the NY Times and The Guardian will never go away. Tell that to the readers of the Rocky Mountain News, which published for 150 years before shutting down in 2009, or to the 2,100+ newspapers that have closed since 2004. Institutions—even big, prominent, established ones—don’t necessarily last.
As one computer scientist quoted in the Nieman piece put it:
“Common Crawl and Internet Archive are widely considered to be the ‘good guys’ and are used by ‘the bad guys’ like OpenAI,” said Michael Nelson, a computer scientist and professor at Old Dominion University. “In everyone’s aversion to not be controlled by LLMs, I think the good guys are collateral damage.”
That’s exactly right. In our rush to punish AI companies, we’re destroying public goods that serve everyone.
The most frustrating bit of all of this: The Guardian admits they haven’t actually documented AI companies scraping their content through the Wayback Machine. This is purely precautionary and theoretical. They’re breaking historical preservation based on a hypothetical threat:
The Guardian hasn’t documented specific instances of its webpages being scraped by AI companies via the Wayback Machine. Instead, it’s taking these measures proactively and is working directly with the Internet Archive to implement the changes.
And, of course, as one of the “good guys” of the internet, the Internet Archive is willing to do exactly what these publishers want. They’ve always been good about removing content or not scraping content that people don’t want in the archive. Sometimes to a fault. But you can never (legitimately) accuse them of malicious archiving (even if music labels and book publishers have).
Either way, we’re sacrificing the historical record not because of proven harm, but because publishers are worried about what might happen. That’s a hell of a tradeoff.
This isn’t even new, of course. Last year, Reddit announced it would block the Internet Archive from archiving its forums—decades of human conversation and cultural history—because Reddit wanted to monetize that content through AI licensing deals. The reasoning was the same: can’t let the Wayback Machine become a backdoor for AI companies to access content Reddit is now selling. But once you start going down that path, it leads to bad places.
The Nieman piece notes that, in the case of USA Today/Gannett, it appears that there was a company-wide decision to tell the Internet Archive to get lost:
In total, 241 news sites from nine countries explicitly disallow at least one out of the four Internet Archive crawling bots.
Most of those sites (87%) are owned by USA Today Co., the largest newspaper conglomerate in the United States formerly known as Gannett. (Gannett sites only make up 18% of Welsh’s original publishers list.) Each Gannett-owned outlet in our dataset disallows the same two bots: “archive.org_bot” and “ia_archiver-web.archive.org”. These bots were added to the robots.txt files of Gannett-owned publications in 2025.
Some Gannett sites have also taken stronger measures to guard their contents from Internet Archive crawlers. URL searches for the Des Moines Register in the Wayback Machine return a message that says, “Sorry. This URL has been excluded from the Wayback Machine.”
A Gannett spokesperson told NiemanLab that it was about “safeguarding our intellectual property” but that’s nonsense. The whole point of libraries and archives is to preserve such content, and they’ve always preserved materials that were protected by copyright law. The claim that they have to be blocked to safeguard such content is both technologically and historically illiterate.
And here’s the extra irony: blocking these crawlers may not even serve publishers’ long-term interests. As I noted in my earlier piece, as more search becomes AI-mediated (whether you like it or not), being absent from training datasets increasingly means being absent from results. It’s a bit crazy to think about how much effort publishers put into “search engine optimization” over the years, only to now block the crawlers that feed the systems a growing number of people are using for search. Publishers blocking archival crawlers aren’t just sacrificing the historical record—they may be making themselves invisible in the systems that increasingly determine how people discover content in the first place.
The Internet Archive’s founder, Brewster Kahle, has been trying to sound the alarm:
“If publishers limit libraries, like the Internet Archive, then the public will have less access to the historical record.”
But that warning doesn’t seem to be getting through. The panic about AI has become so intense that people are willing to sacrifice core internet infrastructure to address it.
What makes this particularly frustrating is that the internet’s openness was never supposed to have asterisks. The fundamental promise wasn’t “publish something and it’s accessible to all, except for technologies we decide we don’t like.” It was just… open. You put something on the public web, people can access it. That simplicity is what made the web transformative.
Now we’re carving out exceptions based on who might access content and what they might do with it. And once you start making those exceptions, where do they end? If the Internet Archive can be blocked because AI companies might use it, what about research databases? What about accessibility tools that help visually impaired users? What about the next technology we haven’t invented yet?
This is a real concern. People say “oh well, blocking machines is different from blocking humans,” but that’s exactly why I mention assistive tech for the visually impaired. Machines accessing content are frequently tools that help humans—including me. I use an AI tool to help fact check my articles, and part of that process involves feeding it the source links. But increasingly, the tool tells me it can’t access those articles to verify whether my coverage accurately reflects them.
I don’t have a clean answer here. Publishers genuinely need to find sustainable business models, and watching their work get ingested by AI systems without compensation is a legitimate grievance—especially when you see how much traffic some of these (usually less scrupulous) crawlers dump on sites. But the solution can’t be to break the historical record of the internet. It can’t be to ensure that our most trusted sources of information are the ones that disappear from archives while the least trustworthy ones remain.
We need to find ways to address AI training concerns that don’t require us to abandon the principle of an open, preservable web. Because right now, we’re building a future where historians, researchers, and citizens can’t access the journalism that documented our era. And that’s not a tradeoff any of us should be comfortable with.
Filed Under: ai, archives, culture, libraries, scanning, scraping
Companies: internet archive, ny times, the guardian, usa today


Comments on “News Publishers Are Now Blocking The Internet Archive, And We May All Regret It”
Well. As a race we have consistently held that profit and riches are worth more than life itself.
It won't stop the AI crawlers anyway
I’ve invested a huge amount of time over the past several years investigating and blocking AI crawlers from the sites that I’m responsible for. They’ve countered every move — by shifting to clouds, to commercial proxies, to residential proxies, to sketchy hosts, to everything. As a result, I’ve had to block quite a few things, but I haven’t blocked the Internet Archive because they’re not the problem.
What could be done in this case is for these publishers to give the IA a private feed under embargo: let them grab everything and stash it so that it’ll remain available to future researchers. I’ve made some arrangements like this with certain academic scholars: I want them to be able to do their work unimpeded by all the defenses I’ve been forced to put in place.
Re:
👍
Re:
If you won’t let the Archive share the material, because then those crawlers could get it, it really doesn’t sound like you’ve been “forced”. Just that you’ve chosen to block a technology you’re opposed to, even if it has no effect on your servers.
The mistakes we’re gonna regret for generations are the regurgitation engines and their thieving masters.
Re:
…says A. Dumbass.
Coward is a good description (so glad TD has that).
I think someone didn’t read the piece.
You doomers are the bane of my existence. Get real.
Why do you consider OpenAI to be “bad guys”?
Re:
You’re misquoting. He’s saying “That’s exactly right” to this bit:
As indicated also by this sentence:
Re: Re:
I don’t accept that as “misquoting”; it seems reasonable to interpret the word “that” as referring to the entire block-quoted segment directly preceding it, especially given that it says “exactly” rather than something like “partially”.
While you might be right about the intended meaning, we can’t really determine that from the text, and I don’t agree with the “indicated by”. That indicates Mike is agreeing, again, with the “collateral damage” part, and says nothing about agreeing or disagreeing with the “bad guy” part. But it could just be sloppy writing.
Re: Re: Re:
Bullshit. Exactly doesn’t mean “entirely” as opposed to “partially.” And you’re ignoring that “bad guys” is in singular quotes in the quote. It’s not an assertion that OpenAI is the bad guys (nor is it that they’re not). It’s an acknowledgement by the speaker that some people perceive it to be that way.
And the succeeding sentence after “that’s exactly right” clearly indicates what Mike intended. He would have said something about OpenAI being a bad guy if that was what “that’s exactly right” was referring to.
It seems pretty clear cut to me. If Mike wanted to express his opinion of OpenAI, he would. You’re being obtuse.
Re: Re: Re:2
Of course it does. The Cambridge dictionary says “completely correct”, Brittanica says nearly the same, and Collins says “no different from what you are stating.” One doesn’t normally call someone exactly right when agreeing with only a single point of several expressed.
That was not clear; the word “and” suggests it to be separate from the “considered to be” point, and it’s not clear whether the quotation marks were meant as scare quotes, actual quotes of some unidentifed third party, or just a marker of colloquial usage. If the text had said “that are used”, it would have obviously been a continuation of “considered to be”.
With two people agreeing, and Mike’s succeeding sentence, I suppose you two are right about the point; I also expect Mike would have said explicitly if now considering OpenAI some kind of villain. But it’s not “clear” what the pronoun “that” referred to. It seems most likely both the quoted person and Mike were using language imprecisely.
On the other hand, another anonymous comment already replied to explain why they believe Mike considers OpenAI a villain, so I may not the only person who interpreted it that way—although the anonymous commenter doesn’t give any justification for why they believe those to be Mike’s reasons, so it could be a non-reply masquerading as a reply.
Re: Re: Re:3
Let me stop you there. Citing a dictionary only tells you how the dictionary denotation writers have observed how a word has been used. It is not prescriptive of how you must use a word nor is it a perfect guide to how someone used a word in a particular instance.
Note that you said “someone” was right, when Mike had said “that,” clearly not referring to the person but to the point her perceived was being made. And you can definitely say “that” is right when referring only to a portion of a longer sentence or thought.
No…? Some writers try to be concise so they don’t always reuse a phrase in the same sentence. “And” here suggests “considered to be” is distributed in front of each group. That’s how I would have written it.
It doesn’t really matter which and those aren’t mutually exclusive scenarios. It’s clear the quotes means the speaker is dissociating themselves from the judgment with the quotes by disavowing them as his own words.
Welcome to human communication.
I didn’t read that as them speaking for Mike. They didn’t say, “Mike believes…” I’d attribute that position entirely to the anonymous commenter. Mike might agree, but I wouldn’t assume so until he says so or you dig into past articles where he has spoken of it. Also, I’d expect Mike to be more nuanced on the topic than is being represented. I’ve been reading his writing for 15+ years.
You seem unwilling to accept Mike’s own words while accepting an anonymous person’s even more imprecise words. Two people out of a sample of hundreds or thousands of possible readers isn’t a consensus.
Re:
Greed.
OpenAI doesn’t want to archive the web for future generations or to give free knowledge. Quite the opposite. It needs a lot of money to exist, and pretty all of it comes from investors that will soon want their money back.
Re:
Mike clearly doesn’t think AI companies are bad guys, even if he might mildly disagree with some of their behaviors. This is plain as day in his relevant posts.
i, though, do. So my question is, why don’t you?
I remember one of the pieces of advice I was given was that I had to put up a paywall on my news site. I was told that if people wanted your content, then they would pay for it. Otherwise, my website would always be unprofitable.
I rejected that because I knew that news is always going to be a public good that needs to be accessible to all, not just for people who have a lot of cash to burn.
Heck, at one point, someone told me that I’m part of the problem of people expecting news to be free because I allow free access. I responded by pointing out that this is a conscious decision because I believed that news is a public good that all should benefit from.
Now, I’m sure those same people are telling others that AI is going to scrape your content as well if you don’t put up a paywall.
In short, business types are putting a LOT of pressure on news websites to paywall EVERYTHING. It seems that they are, sadly, starting to succeed with others.
yet the same organizations will quickly fire as many producers as possible and replace them with the AI they seek to block. make it make sense.
I started noticing this very concerning trend a while ago, and it’s discouraging that it seems to be spreading. Archival for many sites now rests entirely on the shoulders of archive.today, a “rogue” scraper that valiantly ignores exclusion requests and works around paywalls and loginwalls. Unfortunately, they are not necessarily a good actor. Nobody knows for sure who runs it, and they recently DDOS’d a blogger who covered them, probably as an overreaction to legal threats.
And of course, just like IA, archive.today could disappear at any moment. The only way forward for internet archival is to deprecate the centralized repository approach and move to a massively distributed one. Anna’s Archive lights the path here: even if they get taken down by the feds, the archive they’ve built will still be available on countless datahoarder hard drives and will probably be available through a pretty user interface again within a few weeks. We desperately need IA and Archive.today to make an effort to mirror their data in a similar way.
Re:
“overreaction to legal threats”
I wouldn’t be surprised if it was a overreaction.
But man.
Re:
Also, I say man, because it’s sad that either of the two could disappear at any moment.
Re:
Sort of.
The legal threats here were the bogus ones that the Archive owner launched at the blogger, and then Archive unilaterally escalated into their illegal DDOS when their blogger victim didn’t immediately do what they wanted.
Re:
It seems the dispute started when the blogger in question attempted to doxx archive.today’s operator, which is bad, and then said operator resorted to rather questionable methods in retaliation, which made things worse.
Meanwhile, Internet Archive seems to be responding to this situation by rolling out some extremely aggressive new anti-bot measures, presumably to block AI scrapers. Unfortunately they’ve gone way over the line themselves: in the past 48 hours I’ve been blocked by them for escalating intervals for viewing just a handful of pages. Today I was able to view two (count them, two) articles before being blocked for 15 minutes, after which it let me view one (1) more and then blocked me indefinitely. No captcha or other way to get back in quickly, either — when this new block is triggered it just ignores all traffic from your IP address for some interval of time, which can be up to (at least) two full hours.
Needless to say their new restrictions make any serious research using them impossible. Human (not bot) users are now only able to make a handful of queries a day. That is not sustainable and they will have to better tune their new bot-blocker if they don’t want this to turn into a massive self-own …
Two problems
One, I feel that the blocking would’ve happened with or without AI regardless.
“We need to find ways to address AI training concerns that don’t require us to abandon the principle of an open, preservable web.”
Earlier
“I don’t have a clean answer here.”
No offense, but the problem? Even I have a feeling those new companies don’t know.
And another thing, how do you guarantee that way doesn’t get abused, considering how ai crawlers won’t hesitate to abuse it or find a way around?
(not against ai, just wanting to know.)
The problem with news sites blocking archiving is the same problem with paywalls. If the news is actually important, it needs to be widespread and remembered. So you’re saying that what you report isn’t important or at least it’s better to leave society ignorant if they can’t pay you for it. If it is important, hiding it is as good as never having written it. People talk about news agencies being the Fourth Estate and a guardian against corruption, but this makes them worse than useless. It makes them collaborators in the dumbing down of society, which only helps authoritarians. And I’m not saying news agencies shouldn’t get paid for their work. I’m just saying the paywall method is contrary to the nature of reporting the news that is actually relevant to the public. Paywall the sports section.
Re:
I completely agree. It’s why I personally refuse to put up a paywall on my site in the first place. I see a lot of news organizations paywalling their content and the well off smugly talking about how they happily read this stuff while everyone else does without thanks to the growing economic mess we are in. I see the damage done to society and I said, “Nope, not contributing to that rotten trend.”
I also personally think TechDirt is on the same page on this one as well.
This is a bit underplaying the threat. We know that AI companies are using Wayback machine: There is evidence that the Wayback Machine, generally speaking, has been used to train LLMs in the past (they were also caught in the Reddit example). Yeah, we haven’t caught specifically the Guardian, but it’s not really a hypothetical, either. Not that that matters, not stopping an obvious vulnerability until it actually happens would be negligent.
Companies can whitelist things like assistive software, if they know it won’t be sending the data anywhere. The problem with things like your AI tool is that companies want to use them to scrape. The reason AI companies are so horny about having a browser is they can use your legitimate useragent to scrape, and it can’t be stopped, because you are legitimate traffic. It’s an unblockable Trojan horse.
I don’t know that it can, as long as you have significant actors acting in bad faith. The open web is a commons that requires people to be (mostly) good stewards and not piss in the pool. Avoiding a tragedy of the commons requires either restraining them or finding a way for them to contribute back for their usage.
Re:
What you call a problem merely seems like a fact to me. I’m still not really clear why that should be considered a significant threat. If a news company can’t compete with a computer auto-generating stuff, how can they possibly compete with other news companies who employ talented humans?
Much of the best-remembered reporting, such as Woodward and Bernstein’s Watergate coverage, was based on non-public sources, which these crawlers couldn’t ever find anyway (except if people are giving them highly confidential data, as the current U.S. federal government might just be dumb enough to do). They weren’t regurgitating already-published data; they were doing hard work.
And extracting text from images was one of the first problems these companies solved. So… with investors throwing obscene amounts of money at them, why don’t they just rent a post office box and pay the $20/month or whatever it costs to have physical copies delivered? Hell, they likely have enough cash to buy some recycling companies and scan all the paper that comes in.
This all seems like a classic moral panic to me. People feel like they have to “do something”, nevermind whether it works or whether the threat is even realistic.
Re: Re:
Moral panics and stupid reactions can be based on real threats.
Why can’t a news site compete with AI slop? Because humans are fucking stupid.
People vs AI
News organizations:
” We don’t want their AI looking at the content we pay people to create. We want people to pay us to look at the slop our AI creates.”
They’re blocking the Internet Archive cuz when they publish some real whopper bullshit, and get caught, they don’t want the evidence of that being kept.
MSM lies, a lot. It has nothing to do with AI. What’s the AI going to do, copy their “style”?
Words are abundant. Trust isn’t.
There’s a hard truth that legacy news orgs still haven’t fully accepted: in the digital age, the literal words on a page aren’t the scarce asset anymore.
Distribution is infinite. Copy replicable. Archives can be duplicated in seconds. The value isn’t the static article, it’s the trust, authority, relationship, and ecosystem around it.
For the most part, legacy news orgs have done a pretty phenomenal job of burning through that currency themselves.
Sensationalism. Blurred lines between reporting and commentary. Chasing clicks over clarity. When trust erodes, the words just don’t carry the same weight.
Historically, news orgs benefited from scarcity. Limited printing presses. Limited broadcast channels. Limited shelf space. In that world, control of content equaled power.
What’s scarce today is attention, trust, and clarity. Not information. Discernment. Not access. Credibility.
You can block crawlers. Restrict archives. Try to lock down text. But none of that brings back what was lost.
Words are abundant. Trust isn’t.
“We believe in the value of The New York Times’s human-led journalism…” lol we’ll see howlong that lasts if it isn’t untrue already.
Kudos for highlighting this new threat to our historical knowledge. I don’t think it can be stressed enough how important The Internet Archive is as a keeper of the historical record.
I would offer a solution that respects the publishers’ paywall, preserves the historical record, and maybe even generates a small but desirable revenue stream for the Archive: Establish an agreement whereby the Archive is permitted to archive their content BUT LIMIT ACCESS according to the publishers’ wishes. Anyone wanting access should be required to pay a fee, with a base amount going to the publisher, and perhaps an additional fee going to support the Archive.
What do you think? I can’t imagine it’s too technically difficult to implement, though I understand the management burden could be substantial (thus the fee for the Archive to pay for it).