Anonymous Coward

February 13, 2026 at 12:06 pm

Well. As a race we have consistently held that profit and riches are worth more than life itself.

Anonymous Coward

February 13, 2026 at 12:28 pm

It won't stop the AI crawlers anyway

I’ve invested a huge amount of time over the past several years investigating and blocking AI crawlers from the sites that I’m responsible for. They’ve countered every move — by shifting to clouds, to commercial proxies, to residential proxies, to sketchy hosts, to everything. As a result, I’ve had to block quite a few things, but I haven’t blocked the Internet Archive because they’re not the problem.

What could be done in this case is for these publishers to give the IA a private feed under embargo: let them grab everything and stash it so that it’ll remain available to future researchers. I’ve made some arrangements like this with certain academic scholars: I want them to be able to do their work unimpeded by all the defenses I’ve been forced to put in place.

Anonymous Coward

February 13, 2026 at 9:27 pm

Re:

👍

Anonymous Coward

February 14, 2026 at 8:13 am

Re:

all the defenses I’ve been forced to put in place.

If you won’t let the Archive share the material, because then those crawlers could get it, it really doesn’t sound like you’ve been “forced”. Just that you’ve chosen to block a technology you’re opposed to, even if it has no effect on your servers.

Anonymous Coward

February 13, 2026 at 12:45 pm

The mistakes we’re gonna regret for generations are the regurgitation engines and their thieving masters.

Ryan

February 21, 2026 at 11:24 am

Re:

…says A. Dumbass.

Coward is a good description (so glad TD has that).

I think someone didn’t read the piece.

You doomers are the bane of my existence. Get real.

Anonymous Coward

February 13, 2026 at 12:54 pm

“Common Crawl and Internet Archive are widely considered to be the ‘good guys’ and are used by ‘the bad guys’ like OpenAI,” […] That’s exactly right

Why do you consider OpenAI to be “bad guys”?

Strawb (profile)

February 14, 2026 at 2:28 am

Re:

You’re misquoting. He’s saying “That’s exactly right” to this bit:

“In everyone’s aversion to not be controlled by LLMs, I think the good guys are collateral damage.”

As indicated also by this sentence:

In our rush to punish AI companies, we’re destroying public goods that serve everyone.

Anonymous Coward

February 14, 2026 at 8:21 am

Re: Re:

You’re misquoting. He’s saying “That’s exactly right” to this bit

I don’t accept that as “misquoting”; it seems reasonable to interpret the word “that” as referring to the entire block-quoted segment directly preceding it, especially given that it says “exactly” rather than something like “partially”.

While you might be right about the intended meaning, we can’t really determine that from the text, and I don’t agree with the “indicated by”. That indicates Mike is agreeing, again, with the “collateral damage” part, and says nothing about agreeing or disagreeing with the “bad guy” part. But it could just be sloppy writing.

MrWilson (profile)

February 14, 2026 at 10:55 pm

Re: Re: Re:

I don’t accept that as “misquoting”; it seems reasonable to interpret the word “that” as referring to the entire block-quoted segment directly preceding it, especially given that it says “exactly” rather than something like “partially”.

Bullshit. Exactly doesn’t mean “entirely” as opposed to “partially.” And you’re ignoring that “bad guys” is in singular quotes in the quote. It’s not an assertion that OpenAI is the bad guys (nor is it that they’re not). It’s an acknowledgement by the speaker that some people perceive it to be that way.

And the succeeding sentence after “that’s exactly right” clearly indicates what Mike intended. He would have said something about OpenAI being a bad guy if that was what “that’s exactly right” was referring to.

While you might be right about the intended meaning, we can’t really determine that from the text, and I don’t agree with the “indicated by”. That indicates Mike is agreeing, again, with the “collateral damage” part, and says nothing about agreeing or disagreeing with the “bad guy” part. But it could just be sloppy writing.

It seems pretty clear cut to me. If Mike wanted to express his opinion of OpenAI, he would. You’re being obtuse.

Anonymous Coward

February 15, 2026 at 9:02 am

Re: Re: Re:²

Exactly doesn’t mean “entirely” as opposed to “partially.”

Of course it does. The Cambridge dictionary says “completely correct”, Brittanica says nearly the same, and Collins says “no different from what you are stating.” One doesn’t normally call someone exactly right when agreeing with only a single point of several expressed.

And you’re ignoring that “bad guys” is in singular quotes in the quote. It’s not an assertion that OpenAI is the bad guys (nor is it that they’re not). It’s an acknowledgement by the speaker that some people perceive it to be that way.

That was not clear; the word “and” suggests it to be separate from the “considered to be” point, and it’s not clear whether the quotation marks were meant as scare quotes, actual quotes of some unidentifed third party, or just a marker of colloquial usage. If the text had said “that are used”, it would have obviously been a continuation of “considered to be”.

With two people agreeing, and Mike’s succeeding sentence, I suppose you two are right about the point; I also expect Mike would have said explicitly if now considering OpenAI some kind of villain. But it’s not “clear” what the pronoun “that” referred to. It seems most likely both the quoted person and Mike were using language imprecisely.

On the other hand, another anonymous comment already replied to explain why they believe Mike considers OpenAI a villain, so I may not the only person who interpreted it that way—although the anonymous commenter doesn’t give any justification for why they believe those to be Mike’s reasons, so it could be a non-reply masquerading as a reply.

MrWilson (profile)

February 16, 2026 at 9:34 am

Re: Re: Re:³

The Cambridge dictionary says “completely correct”, Brittanica says nearly the same, and Collins says “no different from what you are stating.”

Let me stop you there. Citing a dictionary only tells you how the dictionary denotation writers have observed how a word has been used. It is not prescriptive of how you must use a word nor is it a perfect guide to how someone used a word in a particular instance.

One doesn’t normally call someone exactly right when agreeing with only a single point of several expressed.

Note that you said “someone” was right, when Mike had said “that,” clearly not referring to the person but to the point her perceived was being made. And you can definitely say “that” is right when referring only to a portion of a longer sentence or thought.

the word “and” suggests it to be separate from the “considered to be” point,

No…? Some writers try to be concise so they don’t always reuse a phrase in the same sentence. “And” here suggests “considered to be” is distributed in front of each group. That’s how I would have written it.

and it’s not clear whether the quotation marks were meant as scare quotes, actual quotes of some unidentifed third party, or just a marker of colloquial usage.

It doesn’t really matter which and those aren’t mutually exclusive scenarios. It’s clear the quotes means the speaker is dissociating themselves from the judgment with the quotes by disavowing them as his own words.

It seems most likely both the quoted person and Mike were using language imprecisely.

Welcome to human communication.

On the other hand, another anonymous comment already replied to explain why they believe Mike considers OpenAI a villain

I didn’t read that as them speaking for Mike. They didn’t say, “Mike believes…” I’d attribute that position entirely to the anonymous commenter. Mike might agree, but I wouldn’t assume so until he says so or you dig into past articles where he has spoken of it. Also, I’d expect Mike to be more nuanced on the topic than is being represented. I’ve been reading his writing for 15+ years.

You seem unwilling to accept Mike’s own words while accepting an anonymous person’s even more imprecise words. Two people out of a sample of hundreds or thousands of possible readers isn’t a consensus.

Anonymous Coward

February 14, 2026 at 4:23 pm

Re:

Why do you consider OpenAI to be “bad guys”?

Greed.
OpenAI doesn’t want to archive the web for future generations or to give free knowledge. Quite the opposite. It needs a lot of money to exist, and pretty all of it comes from investors that will soon want their money back.

Anonymous Coward

February 16, 2026 at 10:16 pm

Re:

Mike clearly doesn’t think AI companies are bad guys, even if he might mildly disagree with some of their behaviors. This is plain as day in his relevant posts.

i, though, do. So my question is, why don’t you?

Drew Wilson (user link)

February 13, 2026 at 1:00 pm

I remember one of the pieces of advice I was given was that I had to put up a paywall on my news site. I was told that if people wanted your content, then they would pay for it. Otherwise, my website would always be unprofitable.

I rejected that because I knew that news is always going to be a public good that needs to be accessible to all, not just for people who have a lot of cash to burn.

Heck, at one point, someone told me that I’m part of the problem of people expecting news to be free because I allow free access. I responded by pointing out that this is a conscious decision because I believed that news is a public good that all should benefit from.

Now, I’m sure those same people are telling others that AI is going to scrape your content as well if you don’t put up a paywall.

In short, business types are putting a LOT of pressure on news websites to paywall EVERYTHING. It seems that they are, sadly, starting to succeed with others.

zeiche (profile)

February 13, 2026 at 1:17 pm

yet the same organizations will quickly fire as many producers as possible and replace them with the AI they seek to block. make it make sense.

Anonymous Coward

February 13, 2026 at 4:25 pm

I started noticing this very concerning trend a while ago, and it’s discouraging that it seems to be spreading. Archival for many sites now rests entirely on the shoulders of archive.today, a “rogue” scraper that valiantly ignores exclusion requests and works around paywalls and loginwalls. Unfortunately, they are not necessarily a good actor. Nobody knows for sure who runs it, and they recently DDOS’d a blogger who covered them, probably as an overreaction to legal threats.

And of course, just like IA, archive.today could disappear at any moment. The only way forward for internet archival is to deprecate the centralized repository approach and move to a massively distributed one. Anna’s Archive lights the path here: even if they get taken down by the feds, the archive they’ve built will still be available on countless datahoarder hard drives and will probably be available through a pretty user interface again within a few weeks. We desperately need IA and Archive.today to make an effort to mirror their data in a similar way.

Anonymous Coward

February 13, 2026 at 9:25 pm

Re:

“overreaction to legal threats”

I wouldn’t be surprised if it was a overreaction.

But man.

Anonymous Coward

February 13, 2026 at 9:26 pm

Re:

Also, I say man, because it’s sad that either of the two could disappear at any moment.

Anonymous Coward

February 14, 2026 at 8:38 am

Re:

they recently DDOS’d a blogger who covered them, probably as an overreaction to legal threats.

Sort of.
The legal threats here were the bogus ones that the Archive owner launched at the blogger, and then Archive unilaterally escalated into their illegal DDOS when their blogger victim didn’t immediately do what they wanted.

Anonymous Coward

February 23, 2026 at 9:40 pm

Re:

It seems the dispute started when the blogger in question attempted to doxx archive.today’s operator, which is bad, and then said operator resorted to rather questionable methods in retaliation, which made things worse.

Meanwhile, Internet Archive seems to be responding to this situation by rolling out some extremely aggressive new anti-bot measures, presumably to block AI scrapers. Unfortunately they’ve gone way over the line themselves: in the past 48 hours I’ve been blocked by them for escalating intervals for viewing just a handful of pages. Today I was able to view two (count them, two) articles before being blocked for 15 minutes, after which it let me view one (1) more and then blocked me indefinitely. No captcha or other way to get back in quickly, either — when this new block is triggered it just ignores all traffic from your IP address for some interval of time, which can be up to (at least) two full hours.

Needless to say their new restrictions make any serious research using them impossible. Human (not bot) users are now only able to make a handful of queries a day. That is not sustainable and they will have to better tune their new bot-blocker if they don’t want this to turn into a massive self-own …

Anonymous Coward

February 13, 2026 at 6:21 pm

Two problems

One, I feel that the blocking would’ve happened with or without AI regardless.

“We need to find ways to address AI training concerns that don’t require us to abandon the principle of an open, preservable web.”

Earlier

“I don’t have a clean answer here.”

No offense, but the problem? Even I have a feeling those new companies don’t know.

And another thing, how do you guarantee that way doesn’t get abused, considering how ai crawlers won’t hesitate to abuse it or find a way around?
(not against ai, just wanting to know.)

MrWilson (profile)

February 13, 2026 at 7:44 pm

The problem with news sites blocking archiving is the same problem with paywalls. If the news is actually important, it needs to be widespread and remembered. So you’re saying that what you report isn’t important or at least it’s better to leave society ignorant if they can’t pay you for it. If it is important, hiding it is as good as never having written it. People talk about news agencies being the Fourth Estate and a guardian against corruption, but this makes them worse than useless. It makes them collaborators in the dumbing down of society, which only helps authoritarians. And I’m not saying news agencies shouldn’t get paid for their work. I’m just saying the paywall method is contrary to the nature of reporting the news that is actually relevant to the public. Paywall the sports section.

Drew Wilson (user link)

February 14, 2026 at 5:06 pm

Re:

I completely agree. It’s why I personally refuse to put up a paywall on my site in the first place. I see a lot of news organizations paywalling their content and the well off smugly talking about how they happily read this stuff while everyone else does without thanks to the growing economic mess we are in. I see the damage done to society and I said, “Nope, not contributing to that rotten trend.”

I also personally think TechDirt is on the same page on this one as well.

Arianity (profile)

February 13, 2026 at 9:19 pm

The most frustrating bit of all of this: The Guardian admits they haven’t actually documented AI companies scraping their content through the Wayback Machine.

This is a bit underplaying the threat. We know that AI companies are using Wayback machine: There is evidence that the Wayback Machine, generally speaking, has been used to train LLMs in the past (they were also caught in the Reddit example). Yeah, we haven’t caught specifically the Guardian, but it’s not really a hypothetical, either. Not that that matters, not stopping an obvious vulnerability until it actually happens would be negligent.

I use an AI tool to help fact check my articles, and part of that process involves feeding it the source links. But increasingly, the tool tells me it can’t access those articles to verify whether my coverage accurately reflects them.

Companies can whitelist things like assistive software, if they know it won’t be sending the data anywhere. The problem with things like your AI tool is that companies want to use them to scrape. The reason AI companies are so horny about having a browser is they can use your legitimate useragent to scrape, and it can’t be stopped, because you are legitimate traffic. It’s an unblockable Trojan horse.

And once you start making those exceptions, where do they end?

I don’t know that it can, as long as you have significant actors acting in bad faith. The open web is a commons that requires people to be (mostly) good stewards and not piss in the pool. Avoiding a tragedy of the commons requires either restraining them or finding a way for them to contribute back for their usage.

Anonymous Coward

February 14, 2026 at 8:40 am

Re:

This is a bit underplaying the threat. We know that AI companies are using Wayback machine … The problem with things like your AI tool is that companies want to use them to scrape.

What you call a problem merely seems like a fact to me. I’m still not really clear why that should be considered a significant threat. If a news company can’t compete with a computer auto-generating stuff, how can they possibly compete with other news companies who employ talented humans?

Much of the best-remembered reporting, such as Woodward and Bernstein’s Watergate coverage, was based on non-public sources, which these crawlers couldn’t ever find anyway (except if people are giving them highly confidential data, as the current U.S. federal government might just be dumb enough to do). They weren’t regurgitating already-published data; they were doing hard work.

And extracting text from images was one of the first problems these companies solved. So… with investors throwing obscene amounts of money at them, why don’t they just rent a post office box and pay the $20/month or whatever it costs to have physical copies delivered? Hell, they likely have enough cash to buy some recycling companies and scan all the paper that comes in.

This all seems like a classic moral panic to me. People feel like they have to “do something”, nevermind whether it works or whether the threat is even realistic.

Anonymous Coward

February 16, 2026 at 10:21 pm

Re: Re:

Moral panics and stupid reactions can be based on real threats.

Why can’t a news site compete with AI slop? Because humans are fucking stupid.

Upstream (profile)

February 13, 2026 at 11:45 pm

People vs AI

News organizations:
” We don’t want their AI looking at the content we pay people to create. We want people to pay us to look at the slop our AI creates.”

Anonymous Coward

February 14, 2026 at 3:41 pm

They’re blocking the Internet Archive cuz when they publish some real whopper bullshit, and get caught, they don’t want the evidence of that being kept.

MSM lies, a lot. It has nothing to do with AI. What’s the AI going to do, copy their “style”?

Jeff (user link)

February 15, 2026 at 3:10 am

Words are abundant. Trust isn’t.

There’s a hard truth that legacy news orgs still haven’t fully accepted: in the digital age, the literal words on a page aren’t the scarce asset anymore.

Distribution is infinite. Copy replicable. Archives can be duplicated in seconds. The value isn’t the static article, it’s the trust, authority, relationship, and ecosystem around it.

For the most part, legacy news orgs have done a pretty phenomenal job of burning through that currency themselves.

Sensationalism. Blurred lines between reporting and commentary. Chasing clicks over clarity. When trust erodes, the words just don’t carry the same weight.

Historically, news orgs benefited from scarcity. Limited printing presses. Limited broadcast channels. Limited shelf space. In that world, control of content equaled power.

What’s scarce today is attention, trust, and clarity. Not information. Discernment. Not access. Credibility.

You can block crawlers. Restrict archives. Try to lock down text. But none of that brings back what was lost.

Words are abundant. Trust isn’t.

Anonymous Coward

February 16, 2026 at 10:23 pm

“We believe in the value of The New York Times’s human-led journalism…” lol we’ll see howlong that lasts if it isn’t untrue already.

George L (profile)

February 17, 2026 at 7:46 am

Kudos for highlighting this new threat to our historical knowledge. I don’t think it can be stressed enough how important The Internet Archive is as a keeper of the historical record.

I would offer a solution that respects the publishers’ paywall, preserves the historical record, and maybe even generates a small but desirable revenue stream for the Archive: Establish an agreement whereby the Archive is permitted to archive their content BUT LIMIT ACCESS according to the publishers’ wishes. Anyone wanting access should be required to pay a fee, with a base amount going to the publisher, and perhaps an additional fee going to support the Archive.

What do you think? I can’t imagine it’s too technically difficult to implement, though I understand the management burden could be substantial (thus the fee for the Archive to pay for it).

Thursday
20:04	MAHA Institute: Nix The Entire Childhood Vaccine Schedule (0)
15:52	Ctrl-Alt-Speech: Writing Some Wrongs (0)
13:46	Weasel Words: OpenAI’s Pentagon Deal Won’t Stop AI‑Powered Surveillance (1)
12:16	Don’t Ban Kids From Using Chatbots (23)
10:47	The Wyden Siren Goes Off Again: We'll Be "Stunned" By What the NSA Is Doing Under Section 702 (3)
10:42	Daily Deal: The All-in-One Super-Sized Ethical Hacking Bundle (0)
09:31	Docs Expose CBP's Use Of Ad Data To Track People's Movements (3)
05:28	David Ellison Pinky Swears CNN Will Retain Editorial Independence, Points To CBS (22)
Wednesday
20:07	Beavers Are Not Moose: Buc-ee's Sues Competitor Over Cartoon Moose Branding (17)
15:36	EFF To Court: Don’t Make Embedding Illegal (10)

News Publishers Are Now Blocking The Internet Archive, And We May All Regret It

from the our-digital-history dept

Comments on “News Publishers Are Now Blocking The Internet Archive, And We May All Regret It”

It won't stop the AI crawlers anyway

Re:

Re:

Re:

Re:

Re: Re:

Re: Re: Re:

Re: Re: Re:²

Re: Re: Re:³

Re:

Re:

Re:

Re:

Re:

Re:

Two problems

Re:

Re:

Re: Re:

People vs AI

Words are abundant. Trust isn’t.

Add Your Comment Cancel reply

Comment Options:

What's this?

Get all our posts in your inbox with the Techdirt Daily Newsletter!

The Techdirt Greenhouse

Thursday

Wednesday

More

Tools & Services

Company

Contact

More

News Publishers Are Now Blocking The Internet Archive, And We May All Regret It

from the our-digital-history dept

Comments on “News Publishers Are Now Blocking The Internet Archive, And We May All Regret It”

Add Your Comment Cancel reply

Comment Options:

What's this?

Techdirt Daily Newsletter

Get all our posts in your inbox with the Techdirt Daily Newsletter!

The Techdirt Greenhouse

Thursday

Wednesday

More

Email This Story

Tools & Services

Company

Contact

More