MrWilson (profile)

February 17, 2026 at 4:48 pm

I have used the Wayback Machine to recover lost versions of my own websites, track how far back a policy has been in place for an employer after the institutional knowledge has retired, verify publication dates on news topics to combat disinformation about what what known when, among hundreds of other beneficial uses.

It’s exceptionally shortsighted to block archiving, unless of you have something to hide. It’s like throwing out masters of famous recordings or putting a Picasso on the curb in the rain. You’re destroying history. We’ve seen companies change owners and whole publications and archives just disappear on a whim or spite or ignorance.

GHB (profile)

February 17, 2026 at 5:00 pm

Would be very dumb if the WBM ever allowed AI crawlers, and lose-lose situation.

Backdoor access for AI scraping is one thing, but bandwidth is another problem that even Wikipedia could not tolerate free AI crawling.

Not to mention, excluding (not to be confused with saving newer articles being blocked) old pages that have been saved can also be detrimental to the news site that requested it. When you nuke your web pages you previously allowed it in the past, you just cut off what Wikipedia relies on sources (Wikipedia sources points to a deleted article, that’s no longer accessible on the WBM). And now, even more of the links to your webpages are from google that is already hurting your traffic with its AI overview.

Either way, this puts news sites in an awful lose-lose position:
* Rely on google for traffic? Hope that look-alike results don’t bury you out.
* Rely on ads? Either rely on ads controlled by google, rely on lesser-intrusive ads with little payout or resort to some of the nastiest inventories that are more intrusive and disruptive and a step closer to the type of ads you find on adult sites, “sail the high seas”, file hosting and link shorteners. You’ll risk having users use a certain browser extension and hope you don’t get into a vicious cycle of adblockers vs ads.
* Paywalls? Most people don’t pay much.
* If AI generated content become so rampant, that we cannot tell if content is genuine or not, that may also make people just “not trust any news”. Further killing traffic.

It’s terrible if news sites are dying, and takes their archived articles down with them.

Anonymous Coward

February 17, 2026 at 5:39 pm

Word.

One of the other things is that newspapers should already be preserving their own morgues, like they used to, and leave them publicly accessible. And if they provide for someone else to archive it when they inevitably cash out – idk, like letting IA pick up their storage – that would be great also. But so far, they let someone buy them out who just burns the history, on purpose, like Hollywood did with film from day1.

It’s a pretty sick mindset.

Arianity (profile)

February 17, 2026 at 7:06 pm

These concerns are understandable, but unfounded. The Wayback Machine is not intended to be a backdoor for large-scale commercial scraping and, like others on the web today, we expend significant time and effort working to prevent such abuse.

While it may not be intended, I don’t see how you can say it’s unfounded when it is actively being used that way. The IA has already appeared in AI datasets. I appreciate you’re doing your best, but I don’t see how you can credibly claim this is an unfounded concern, or promise it will stop. It is not possible to completely stop scraping, only mitigate it.

Also, notably:

Currently, however, the Internet Archive does not disallow any specific crawlers through its robots.txt file, including those of major AI companies. As of January 12, the robots.txt file for archive.org read: “Welcome to the Archive! Please crawl our files. We appreciate it if you can crawl responsibly. Stay open!”…The Internet Archive blocked the hosts twice before putting out a public call to “respectfully” scrape its site.…“We got in contact with them. They ended up giving us a donation,” Graham said. “They ended up saying that they were sorry and they stopped doing it.”…“Those wanting to use our materials in bulk should start slowly, and ramp up,” wrote Kahle in a blog post shortly after the incident. “Also, if you are starting a large project please contact us …we are here to help.”*

Anonymous Coward

February 17, 2026 at 7:33 pm

Re:

The “open” (the part that can be accessed without paying anything) web is not enough to train largest LLMs from two years ago, so not only Wikipedia (as one of the best humanly created content), Common Crawl and the Internet Archive (with 1000B webpages saved) are the bedrock (or as Microsoft said when training its AI, are “freeware”, as in just grab it for free) but it’s even not enough to build a decent LLM.
A fair share of spending from AI companies is to build index engine and spawn an army of crawlers to get as much content as needed.
So yes, removing Internet Archive from it may reduce the amount of training content a bit (even if the content is mostly old), it will certainly reduce the overall outputs of LLMs (it may not be the case from Reddit pages), and theses AI companies will spend even more to keep decent quality outputs, greatly raising hallucinations.

Anonymous Coward

February 17, 2026 at 9:06 pm

Re:

Every time someone says anything even vaguely critical of the Internet Archive they come out with one of these plaintive, wounded moans of “but we’re the good guys” rather than actually engaging with the substance of the criticism. Deeply annoying, but it seems to be working for them.

Drew Wilson (user link)

February 17, 2026 at 9:09 pm

One of the axioms I’ve heard over the years is “the Internet never forgets”. The problem with that axiom is that it’s not really true. It may be applicable for really popular content for a few years, but the internet does, in fact, forget. Digital rot is a very real thing.

Had I not preserved a whole stack of articles on my site, I’m willing to bet some of those articles would’ve simply disappeared from the web completely. I had a heck of a hard time finding some of them so I could repost them. Some were only available on an archived post on the WayBack Machine. Others were still lingering in Google cache. Some were only available on the other website while it was still alive (neither are alive any more). Still, I know some are probably lost forever because I didn’t think of archiving everything I wrote when I was first writing news. Had a wrong mindset that the articles would always be there in some form or another. A really big mistake that I have since rectified.

Anonymous Coward

February 18, 2026 at 2:07 pm

I use archive.org for my own website

Due to running an unsupported gallery software for 7 years longer than I should have it got hacked eventually. I transferred the photos over to zenphoto but the captions did not come with them. My attempts to even get command line access for the new host failed. archive.org has been very helpful for redoing captions. Only another 40,000 captions to redo by hand!

Anonymous Coward

February 18, 2026 at 3:34 pm

These concerns are understandable,

They really aren’t. The only reasonable concern here is scrapers overloading their server, which obviously is not an issue for them when it’s IA’s server.

If they just don’t want anyone or anything to learn from them without forking over cash, they can get fucked.

Anonymous Coward

February 21, 2026 at 6:47 am

FYI, the Wayback Machine has been extremely unreliable lately, with rampant 503 errors.

Friday
12:05	Ctrl-Alt-Speech: Making The Best Of A Ban Situation (0)
Thursday
19:39	The Nintendo/Palworld Patent Suit Appears To Be Heading For A Muted Conclusion (4)
15:04	Sotomayor Trashes SCOTUS Majority For Cherry-Picking Qualified Immunity Cases To Reverse (4)
13:05	T-Mobile Jacks Up Prices For Everybody, Ignores Years Of 'Uncarrier' Promises (4)
10:59	Thin-Skinned Palantir Loses Its Bid To Bully A Swiss Magazine Into Publishing Its Rebuttals To Embarrassing Reporting (3)
10:54	Daily Deal: MYNT3D Professional Printing 3D Pen with OLED Display (0)
09:27	Man Arrested For Playing Darth Vader's Theme Music At National Guard Troops Scores Settlement (14)
05:30	More IPO Fluffing: Musk's Starlink Hints At Becoming Full Wireless Phone Company (5)
Wednesday
19:58	No, Tim Sweeney, Valve Isn't 'Irresponsible' For Having An AI Disclosure Tag On Games (36)
14:55	German Court Says Google Is Liable For False Claims In Its AI Overviews Because They Are Its Own Words (37)

Preserving The Web Is Not The Problem. Losing It Is.

from the libraries-matter dept

Comments on “Preserving The Web Is Not The Problem. Losing It Is.”

Would be very dumb if the WBM ever allowed AI crawlers, and lose-lose situation.

Re:

Re:

I use archive.org for my own website

Add Your Comment Cancel reply

Comment Options:

What's this?

Get all our posts in your inbox with the Techdirt Daily Newsletter!

The Techdirt Greenhouse

Trending Posts

Friday

Thursday

Wednesday

More

Tools & Services

Company

Contact

More

Preserving The Web Is Not The Problem. Losing It Is.

from the libraries-matter dept

Comments on “Preserving The Web Is Not The Problem. Losing It Is.”

Add Your Comment Cancel reply

Comment Options:

What's this?

Techdirt Daily Newsletter

Get all our posts in your inbox with the Techdirt Daily Newsletter!

The Techdirt Greenhouse

Trending Posts

Friday

Thursday

Wednesday

More

Email This Story

Tools & Services

Company

Contact

More