Preserving The Web Is Not The Problem. Losing It Is.

from the libraries-matter dept

Recent reporting by Nieman Lab describes how some major news organizations—including The Guardian, The New York Times, and Reddit—are limiting or blocking access to their content in the Internet Archive’s Wayback Machine. As stated in the article, these organizations are blocking access largely out of concern that generative AI companies are using the Wayback Machine as a backdoor for large-scale scraping.

These concerns are understandable, but unfounded. The Wayback Machine is not intended to be a backdoor for large-scale commercial scraping and, like others on the web today, we expend significant time and effort working to prevent such abuse. Whatever legitimate concerns people may have about generative AI, libraries are not the problem, and blocking access to web archives is not the solution; doing so risks serious harm to the public record.

The Internet Archive, a 501(c)(3) nonprofit public charity and a federal depository library, has been building its archive of the world wide web since 1996. Today, the Wayback Machine provides access to thirty years’ worth of web history and culture. It has become an essential resource for journalists, researchers, courts, and the public. 

For three decades the Wayback Machine has peacefully coexisted with the development of the web, including the websites mentioned in the article. Our mission is simple: to preserve knowledge and make it accessible for research, accountability, and historical understanding. 

As tech policy writer Mike Masnick recently warned, blocking preservation efforts risks a profound unintended consequence: “significant chunks of our journalistic record and historical cultural context simply… disappear.” He notes that when trusted publications are absent from archives, we risk creating a historical record biased against quality journalism.

There is no question that generative AI has changed the landscape of the world wide web. But it is important to be clear about what the Wayback Machine is, and what it is not.

The Wayback Machine is built for human readers. We use rate limiting, filtering, and monitoring to prevent abusive access, and we watch for and actively respond to new scraping patterns as they emerge.

We acknowledge that systems can always be improved. We are actively working with publishers on technical solutions to strengthen our systems and address legitimate concerns without erasing the historical record.

What concerns me most is the unintended consequence of these blocks. When libraries are blocked from archiving the web, the public loses access to history. Journalists lose tools for accountability. Researchers lose evidence. The web becomes more fragile and more fragmented, and history becomes easier to rewrite.

Generative AI presents real challenges in today’s information ecosystem. But preserving the time-honored role of libraries and archives in society has never been more important. We’ve worked alongside news organizations for decades. Let’s continue working together in service of an open, referenceable, and enduring web.

Mark Graham is the Director of the Wayback Machine at the Internet Archive

Filed Under: , , , , , ,
Companies: internet archive

Rate this comment as insightful
Rate this comment as funny
You have rated this comment as insightful
You have rated this comment as funny
Flag this comment as abusive/trolling/spam
You have flagged this comment
The first word has already been claimed
The last word has already been claimed
Insightful Lightbulb icon Funny Laughing icon Abusive/trolling/spam Flag icon Insightful badge Lightbulb icon Funny badge Laughing icon Comments icon

Comments on “Preserving The Web Is Not The Problem. Losing It Is.”

Subscribe: RSS Leave a comment
4 Comments
MrWilson (profile) says:

I have used the Wayback Machine to recover lost versions of my own websites, track how far back a policy has been in place for an employer after the institutional knowledge has retired, verify publication dates on news topics to combat disinformation about what what known when, among hundreds of other beneficial uses.

It’s exceptionally shortsighted to block archiving, unless of you have something to hide. It’s like throwing out masters of famous recordings or putting a Picasso on the curb in the rain. You’re destroying history. We’ve seen companies change owners and whole publications and archives just disappear on a whim or spite or ignorance.

GHB (profile) says:

Would be very dumb if the WBM ever allowed AI crawlers, and lose-lose situation.

Backdoor access for AI scraping is one thing, but bandwidth is another problem that even Wikipedia could not tolerate free AI crawling.

Not to mention, excluding (not to be confused with saving newer articles being blocked) old pages that have been saved can also be detrimental to the news site that requested it. When you nuke your web pages you previously allowed it in the past, you just cut off what Wikipedia relies on sources (Wikipedia sources points to a deleted article, that’s no longer accessible on the WBM). And now, even more of the links to your webpages are from google that is already hurting your traffic with its AI overview.

Either way, this puts news sites in an awful lose-lose position:
* Rely on google for traffic? Hope that look-alike results don’t bury you out.
* Rely on ads? Either rely on ads controlled by google, rely on lesser-intrusive ads with little payout or resort to some of the nastiest inventories that are more intrusive and disruptive and a step closer to the type of ads you find on adult sites, “sail the high seas”, file hosting and link shorteners. You’ll risk having users use a certain browser extension and hope you don’t get into a vicious cycle of adblockers vs ads.
* Paywalls? Most people don’t pay much.
* If AI generated content become so rampant, that we cannot tell if content is genuine or not, that may also make people just “not trust any news”. Further killing traffic.

It’s terrible if news sites are dying, and takes their archived articles down with them.

Anonymous Coward says:

Word.

One of the other things is that newspapers should already be preserving their own morgues, like they used to, and leave them publicly accessible. And if they provide for someone else to archive it when they inevitably cash out – idk, like letting IA pick up their storage – that would be great also. But so far, they let someone buy them out who just burns the history, on purpose, like Hollywood did with film from day1.

It’s a pretty sick mindset.

Arianity (profile) says:

These concerns are understandable, but unfounded. The Wayback Machine is not intended to be a backdoor for large-scale commercial scraping and, like others on the web today, we expend significant time and effort working to prevent such abuse.

While it may not be intended, I don’t see how you can say it’s unfounded when it is actively being used that way. The IA has already appeared in AI datasets. I appreciate you’re doing your best, but I don’t see how you can credibly claim this is an unfounded concern, or promise it will stop. It is not possible to completely stop scraping, only mitigate it.

Also, notably:

Currently, however, the Internet Archive does not disallow any specific crawlers through its robots.txt file, including those of major AI companies. As of January 12, the robots.txt file for archive.org read: “​​Welcome to the Archive! Please crawl our files. We appreciate it if you can crawl responsibly. Stay open!”The Internet Archive blocked the hosts twice before putting out a public call to “respectfully” scrape its site.…“We got in contact with them. They ended up giving us a donation,” Graham said. “They ended up saying that they were sorry and they stopped doing it.”…“Those wanting to use our materials in bulk should start slowly, and ramp up,” wrote Kahle in a blog post shortly after the incident. “Also, if you are starting a large project please contact us …we are here to help.”*

Add Your Comment

Your email address will not be published. Required fields are marked *

Have a Techdirt Account? Sign in now. Want one? Register here

Comment Options:

Make this the or (get credits or sign in to see balance) what's this?

What's this?

Techdirt community members with Techdirt Credits can spotlight a comment as either the "First Word" or "Last Word" on a particular comment thread. Credits can be purchased at the Techdirt Insider Shop »

Follow Techdirt

Techdirt Daily Newsletter

Subscribe to Our Newsletter

Get all our posts in your inbox with the Techdirt Daily Newsletter!

We don’t spam. Read our privacy policy for more info.

Ctrl-Alt-Speech

A weekly news podcast from
Mike Masnick & Ben Whitelaw

Subscribe now to Ctrl-Alt-Speech »
Techdirt Deals
Techdirt Insider Discord
The latest chatter on the Techdirt Insider Discord channel...
Loading...