Preserving The Web Is Not The Problem. Losing It Is.
from the libraries-matter dept
Recent reporting by Nieman Lab describes how some major news organizations—including The Guardian, The New York Times, and Reddit—are limiting or blocking access to their content in the Internet Archive’s Wayback Machine. As stated in the article, these organizations are blocking access largely out of concern that generative AI companies are using the Wayback Machine as a backdoor for large-scale scraping.
These concerns are understandable, but unfounded. The Wayback Machine is not intended to be a backdoor for large-scale commercial scraping and, like others on the web today, we expend significant time and effort working to prevent such abuse. Whatever legitimate concerns people may have about generative AI, libraries are not the problem, and blocking access to web archives is not the solution; doing so risks serious harm to the public record.
The Internet Archive, a 501(c)(3) nonprofit public charity and a federal depository library, has been building its archive of the world wide web since 1996. Today, the Wayback Machine provides access to thirty years’ worth of web history and culture. It has become an essential resource for journalists, researchers, courts, and the public.
For three decades the Wayback Machine has peacefully coexisted with the development of the web, including the websites mentioned in the article. Our mission is simple: to preserve knowledge and make it accessible for research, accountability, and historical understanding.
As tech policy writer Mike Masnick recently warned, blocking preservation efforts risks a profound unintended consequence: “significant chunks of our journalistic record and historical cultural context simply… disappear.” He notes that when trusted publications are absent from archives, we risk creating a historical record biased against quality journalism.
There is no question that generative AI has changed the landscape of the world wide web. But it is important to be clear about what the Wayback Machine is, and what it is not.
The Wayback Machine is built for human readers. We use rate limiting, filtering, and monitoring to prevent abusive access, and we watch for and actively respond to new scraping patterns as they emerge.
We acknowledge that systems can always be improved. We are actively working with publishers on technical solutions to strengthen our systems and address legitimate concerns without erasing the historical record.
What concerns me most is the unintended consequence of these blocks. When libraries are blocked from archiving the web, the public loses access to history. Journalists lose tools for accountability. Researchers lose evidence. The web becomes more fragile and more fragmented, and history becomes easier to rewrite.
Generative AI presents real challenges in today’s information ecosystem. But preserving the time-honored role of libraries and archives in society has never been more important. We’ve worked alongside news organizations for decades. Let’s continue working together in service of an open, referenceable, and enduring web.
Mark Graham is the Director of the Wayback Machine at the Internet Archive
Filed Under: ai, archives, journalism, libraries, preserving history, scraping, wayback machine
Companies: internet archive


Comments on “Preserving The Web Is Not The Problem. Losing It Is.”
I have used the Wayback Machine to recover lost versions of my own websites, track how far back a policy has been in place for an employer after the institutional knowledge has retired, verify publication dates on news topics to combat disinformation about what what known when, among hundreds of other beneficial uses.
It’s exceptionally shortsighted to block archiving, unless of you have something to hide. It’s like throwing out masters of famous recordings or putting a Picasso on the curb in the rain. You’re destroying history. We’ve seen companies change owners and whole publications and archives just disappear on a whim or spite or ignorance.
Would be very dumb if the WBM ever allowed AI crawlers, and lose-lose situation.
Backdoor access for AI scraping is one thing, but bandwidth is another problem that even Wikipedia could not tolerate free AI crawling.
Not to mention, excluding (not to be confused with saving newer articles being blocked) old pages that have been saved can also be detrimental to the news site that requested it. When you nuke your web pages you previously allowed it in the past, you just cut off what Wikipedia relies on sources (Wikipedia sources points to a deleted article, that’s no longer accessible on the WBM). And now, even more of the links to your webpages are from google that is already hurting your traffic with its AI overview.
Either way, this puts news sites in an awful lose-lose position:
* Rely on google for traffic? Hope that look-alike results don’t bury you out.
* Rely on ads? Either rely on ads controlled by google, rely on lesser-intrusive ads with little payout or resort to some of the nastiest inventories that are more intrusive and disruptive and a step closer to the type of ads you find on adult sites, “sail the high seas”, file hosting and link shorteners. You’ll risk having users use a certain browser extension and hope you don’t get into a vicious cycle of adblockers vs ads.
* Paywalls? Most people don’t pay much.
* If AI generated content become so rampant, that we cannot tell if content is genuine or not, that may also make people just “not trust any news”. Further killing traffic.
It’s terrible if news sites are dying, and takes their archived articles down with them.
Word.
One of the other things is that newspapers should already be preserving their own morgues, like they used to, and leave them publicly accessible. And if they provide for someone else to archive it when they inevitably cash out – idk, like letting IA pick up their storage – that would be great also. But so far, they let someone buy them out who just burns the history, on purpose, like Hollywood did with film from day1.
It’s a pretty sick mindset.
While it may not be intended, I don’t see how you can say it’s unfounded when it is actively being used that way. The IA has already appeared in AI datasets. I appreciate you’re doing your best, but I don’t see how you can credibly claim this is an unfounded concern, or promise it will stop. It is not possible to completely stop scraping, only mitigate it.
Also, notably:
Currently, however, the Internet Archive does not disallow any specific crawlers through its robots.txt file, including those of major AI companies. As of January 12, the robots.txt file for archive.org read: “Welcome to the Archive! Please crawl our files. We appreciate it if you can crawl responsibly. Stay open!”…The Internet Archive blocked the hosts twice before putting out a public call to “respectfully” scrape its site.…“We got in contact with them. They ended up giving us a donation,” Graham said. “They ended up saying that they were sorry and they stopped doing it.”…“Those wanting to use our materials in bulk should start slowly, and ramp up,” wrote Kahle in a blog post shortly after the incident. “Also, if you are starting a large project please contact us …we are here to help.”*
Re:
The “open” (the part that can be accessed without paying anything) web is not enough to train largest LLMs from two years ago, so not only Wikipedia (as one of the best humanly created content), Common Crawl and the Internet Archive (with 1000B webpages saved) are the bedrock (or as Microsoft said when training its AI, are “freeware”, as in just grab it for free) but it’s even not enough to build a decent LLM.
A fair share of spending from AI companies is to build index engine and spawn an army of crawlers to get as much content as needed.
So yes, removing Internet Archive from it may reduce the amount of training content a bit (even if the content is mostly old), it will certainly reduce the overall outputs of LLMs (it may not be the case from Reddit pages), and theses AI companies will spend even more to keep decent quality outputs, greatly raising hallucinations.
Re:
Every time someone says anything even vaguely critical of the Internet Archive they come out with one of these plaintive, wounded moans of “but we’re the good guys” rather than actually engaging with the substance of the criticism. Deeply annoying, but it seems to be working for them.
One of the axioms I’ve heard over the years is “the Internet never forgets”. The problem with that axiom is that it’s not really true. It may be applicable for really popular content for a few years, but the internet does, in fact, forget. Digital rot is a very real thing.
Had I not preserved a whole stack of articles on my site, I’m willing to bet some of those articles would’ve simply disappeared from the web completely. I had a heck of a hard time finding some of them so I could repost them. Some were only available on an archived post on the WayBack Machine. Others were still lingering in Google cache. Some were only available on the other website while it was still alive (neither are alive any more). Still, I know some are probably lost forever because I didn’t think of archiving everything I wrote when I was first writing news. Had a wrong mindset that the articles would always be there in some form or another. A really big mistake that I have since rectified.
I use archive.org for my own website
Due to running an unsupported gallery software for 7 years longer than I should have it got hacked eventually. I transferred the photos over to zenphoto but the captions did not come with them. My attempts to even get command line access for the new host failed. archive.org has been very helpful for redoing captions. Only another 40,000 captions to redo by hand!
They really aren’t. The only reasonable concern here is scrapers overloading their server, which obviously is not an issue for them when it’s IA’s server.
If they just don’t want anyone or anything to learn from them without forking over cash, they can get fucked.
FYI, the Wayback Machine has been extremely unreliable lately, with rampant 503 errors.