Blocking The Internet Archive Won’t Stop AI, But It Will Erase The Web’s Historical Record

from the willingly-burning-libraries dept

Thu, Mar 26th 2026 01:44pm - Joe Mullin

Imagine a newspaper publisher announcing it will no longer allow libraries to keep copies of its paper.

That’s effectively what’s begun happening online in the last few months. The Internet Archive—the world’s largest digital library—has preserved newspapers since it went online in the mid-1990s. The Archive’s mission is to preserve the web and make it accessible to the public. To that end, the organization operates the Wayback Machine, which now contains more than one trillion archived web pages and is used daily by journalists, researchers, and courts.

But in recent months The New York Times began blocking the Archive from crawling its website, using technical measures that go beyond the web’s traditional robots.txt rules. That risks cutting off a record that historians and journalists have relied on for decades. Other newspapers, including The Guardian, seem to be following suit.

For nearly three decades, historians, journalists, and the public have relied on the Internet Archive to preserve news sites as they appeared online. Those archived pages are often the only reliable record of how stories were originally published. In many cases, articles get edited, changed, or removed—sometimes openly, sometimes not. The Internet Archive often becomes the only source for seeing those changes. When major publishers block the Archive’s crawlers, that historical record starts to disappear.

The Times says the move is driven by concerns about AI companies scraping news content. Publishers seek control over how their work is used, and several—including the Times—are now suing AI companies over whether training models on copyrighted material violates the law. There’s a strong case that such training is fair use.

Whatever the outcome of those lawsuits, blocking nonprofit archivists is the wrong response. Organizations like the Internet Archive are not building commercial AI systems. They are preserving a record of our history. Turning off that preservation in an effort to control AI access could essentially torch decades of historical documentation over a fight that libraries like the Archive didn’t start, and didn’t ask for.

If publishers shut the Archive out, they aren’t just limiting bots. They’re erasing the historical record.

Archiving and Search Are Legal

Making material searchable is a well-established fair use. Courts have long recognized it’s often impossible to build a searchable index without making copies of the underlying material. That’s why when Google copied entire books in order to make a searchable database, courts rightly recognized it as a clear fair use. The copying served a transformative purpose: enabling discovery, research, and new insights about creative works.

The Internet Archive operates on the same principle. Just as physical libraries preserve newspapers for future readers, the Archive preserves the web’s historical record. Researchers and journalists rely on it every day. According to Archive staff, Wikipedia alone links to more than 2.6 million news articles preserved at the Archive, spanning 249 languages. And that’s only one example. Countless bloggers, researchers, and reporters depend on the Archive as a stable, authoritative record of what was published online.

The same legal principles that protect search engines must also protect archives and libraries. Even if courts place limits on AI training, the law protecting search and web archiving is already well established.

The Internet Archive has preserved the web’s historical record for nearly thirty years. If major publishers begin blocking that mission, future researchers may find that huge portions of that historical record have simply vanished. There are real disputes over AI training that must be resolved in courts. But sacrificing the public record to fight those battles would be a profound, and possibly irreversible, mistake.

Comments on “Blocking The Internet Archive Won’t Stop AI, But It Will Erase The Web’s Historical Record”

Arianity (profile)

March 26, 2026 at 4:55 pm

Organizations like the Internet Archive are not building commercial AI systems.

No, but they are being used by people who are.

the Archive didn’t start, and didn’t ask for.

I feel like the complaints should be directed at the AI companies misusing you. The news companies didn’t start or ask for this, either.

Pixelation

March 26, 2026 at 5:49 pm

Hopefully, it can erase Trump and MAGA.

Anonymous Coward

March 26, 2026 at 6:33 pm

Man, I miss when I could take a call from a random number I didn’t recognize and it wouldn’t be spam. I miss when an unsolicited knock on the door was just girl scouts with cookies and not another jehova witness or mormon. I miss when I didn’t have to run an ad blocker because websites were readable without ads. I miss when I didn’t need to have emails just to login to services because I didn’t want all the spam.

It’s almost like we have historically been terrible at dealing with bad actors abusing and spaming systems to the point good actors have been hurt.

In your rush to be an AI bootlicker you have ignored some pretty basic info. The response to AI spam is not new.

How about this. Why don’t you personally pay websites to accept AI spam? I’m sure as much as you love it, you would love to cover the costs it incurs? Or are you just pouting with no real answer because people are being mean to your pal AI?

Anonymous Coward

March 26, 2026 at 8:16 pm

i tend to think of AI as the excuse that permission culture is using at the moment to try and destroy not just IA, but all libraries. i think also that newspapers just don’t want to be part of the historical record anymore, and frankly, i don’t think they care to distribute actual news either, so not a big loss. However, the already historical newspaper morgues which have not already been destroyed are often locked up and under further threat, wherever and however they are housed.

Of course, blocking AI from tokenizing more bullshit and dreck is probably not the worst idea, if we are stuck with it, on multiple counts.

The Phule

March 26, 2026 at 8:36 pm

Ban generative AI

This problem, like so many others, will stop the more major countries ban LLMs and generative AI.

Anonymous Coward

March 27, 2026 at 9:30 am

IA should ignore robots.txt.

Anonymous Coward

March 27, 2026 at 9:57 am

Those archived pages are often the only reliable record of how stories were originally published.

This is probably the real reason they want to block them.

IA could take a page from Archive Team’s book and give users the ability to opt in to a scraping botnet so they can use residential IPs to circumvent these blocks.

Anonymous Coward

March 27, 2026 at 6:57 pm

SO few comments. Looks like you can’t handle how much you get criticized for your dumb as rocks AI takes.

Seems like techdirt is trash. Might as well be fox.

kythra (profile)

March 29, 2026 at 5:40 pm

I'm totally on the Internet Archive's side.

I don’t know what’s wrong with your comments section honestly. For the record, Best Tech news site of them all.

kythra (profile)

March 29, 2026 at 5:41 pm

Oh and for the record if you banned generative AI

I’d break the law and keep using it anyway.

Jim Collinsworth (profile)

March 31, 2026 at 5:53 am

Let's save 2022 Archive, and stop.

The Internet Archive ~2022 should be everyone’s trusted source given where AI is rapidly taking online content, eventually even the archives will degrade. Lets just use static info before 2022 going forward, it would handle 99.99% of most questions. And do we really want to save whats happened after 2022? maybe a break is ok.

So someone (not me, its big) should package up a 2022 archive copy (100 pentabytes big), add an LLM for access and magic. Scroll like it’s 2022, none of that annoying and increasingly made up post-2022 information to worry about.

Package up is more than just a copy, needs to be distributed, secured, trusted. Maybe some use for blockchain to ensure no actors can ever affect the archive and it remains a reliable and incredibly useful repository of world knowledge up to 2022. Internet 3.0, first and last release.

Add Your Comment

Wednesday
12:59	Prosecutors Still Trying To Convict 62-Year-Old Woman For Wearing Penis Costume To Anti-Trump Protest (3)
10:55	Remember The "Ministry Of Truth" Freakout? Rubio Is Now Doing Something Far Worse Through Elon Musk's X (2)
10:51	Daily Deal: The Complete Raspberry Pi And Alexa A-Z Bundle (0)
09:23	Judge Tells Border Officers (Again!) That They Can't Arrest Migrants Without Real Warrants (3)
05:25	Trump Attacks On Public Media Blocked By Judge (But It's Too Little, Too Late) (4)
Tuesday
20:23	Trump’s Justice Department Dropped 23,000 Criminal Investigations To Focus On Deportations (12)
15:25	With Cox V. Sony The Supreme Court Provides Yet Another Internet-Protecting Decision (3)
13:30	Techdirt Podcast Episode 449: The Dangers Of Product Design Liability For Social Media (0)
10:52	The New York Times Got Played By A Telehealth Scam And Called It The Future Of AI (31)
10:47	Daily Deal: Costco 1-Year Gold Star Membership + $20 Digital Costco Shop Card (1)

Blocking The Internet Archive Won’t Stop AI, But It Will Erase The Web’s Historical Record

from the willingly-burning-libraries dept

Archiving and Search Are Legal

Comments on “Blocking The Internet Archive Won’t Stop AI, But It Will Erase The Web’s Historical Record”

Ban generative AI

I'm totally on the Internet Archive's side.

Oh and for the record if you banned generative AI

Let's save 2022 Archive, and stop.

Add Your Comment Cancel reply

Comment Options:

What's this?

Get all our posts in your inbox with the Techdirt Daily Newsletter!

The Techdirt Greenhouse

Trending Posts

Wednesday

Tuesday

More

Tools & Services

Company

Contact

More

Blocking The Internet Archive Won’t Stop AI, But It Will Erase The Web’s Historical Record

from the willingly-burning-libraries dept

Archiving and Search Are Legal

Comments on “Blocking The Internet Archive Won’t Stop AI, But It Will Erase The Web’s Historical Record”

Add Your Comment Cancel reply

Comment Options:

What's this?

Techdirt Daily Newsletter

Get all our posts in your inbox with the Techdirt Daily Newsletter!

The Techdirt Greenhouse

Trending Posts

Wednesday

Tuesday

More

Email This Story

Tools & Services

Company

Contact

More