340 Local News Outlets Now Blocking The Internet Archive

from the history-is-now-a-black-hole dept

Earlier this year Nieman Lab broke the story that major news publishers, including The New York Times, The Guardian, and USA Today Co., had started blocking the Internet Archive for fear that AI companies might scrape the nonprofit’s repositories for training data. As one of the last bastions of archival history, that is, in case you’re not aware, not very good for the public interest.

Four months later and Nieman Lab now notes that the number of news outlets blocking the archive has soared to around 340 organizations:

“Our new analysis shows that more than 340 local news sites across the United States are now limiting the Internet Archive’s ability to access and preserve their stories. Many sites in our sample are owned by five of the seven largest local news publishers in the country: USA Today Co., McClatchy, Advance Local, MediaNews Group, and Tribune Publishing. The latter two are both subsidiaries of the “vulture hedge fund” Alden Global Capital.”

Many of these localities are already effectively news deserts, where most real local journalism was hollowed out and replaced by a smattering of local right wing broadcasters (like Sinclair Broadcasting) or a hedge fund run “local newspaper” that doesn’t do much in the way of actual local reporting. That’s generally also been terrible for informed consensus or shedding a light on local corruption.

Some of the outlets blocking internet archive access have legitimate concerns about protecting their hard work from being repackaged and resold without compensation or citation. But an awful lot of the folks grumbling about the Internet Archive were never in the journalism business to serve the public interest in the first place.

Regardless of motivation, hiding whatever local news remains behind paywalls, then blocking it from the Internet Archive, in turn makes it harder for everyone else to do real journalism that relies on the historical record, local journalists tell Nieman Lab:

“I cover news within a larger news desert in New York’s Rockland, Sullivan, and Rockland counties. This means I need to heavily rely on archival data of old news articles from now deceased, or zombie-fied, media outlets,” wrote B.J. Mendelson, the editor of The Monroe Gazette newsletter, in one recent petition signed by over 200 journalists. “Without the Internet Archive, my [work] would be incredibly difficult to do.”

Trying to address publisher concerns, the folks at the Wayback Machine have highlighted ongoing efforts to minimize abuse of the site, including restrictions on bulk downloading and collaborating with Cloudflare to monitor bot activity.

But even beyond AI scraping, many corporate media owners simply can’t see beyond the narrow interests of paywalled revenue. And corporate power — and authoritarianism — sometimes in collaboration — both tend to benefit from a misinformed electorate that doesn’t have a firm grip on the lessons learned from historical experience, and doesn’t have easy access to the factual record.

As a journalist of several decades, the vast vast majority of my work has been deleted by website owners and companies that simply couldn’t have cared any less about archival history or any sort of permanent record. My explorations of telecom policy have disappeared, but Verizon, AT&T, and Comcast’s version of the historical record generally remains. You can probably see how that’s of benefit to corporate power.

But again, smaller, independent, local news outlets on fixed budgets have particularly legitimate concerns about the tech giants’ plan to hijack and repackage the entirety of their work using AI without any compensation or attribution whatsoever. The Internet Archive folks say they are listening to those concerns, while also trying to train news orgs on archival preservation:

“In December, the Internet Archive partnered with the Poynter Institute and Investigative Reporters and Editors to train a cohort of 33 local and national news outlets on how to develop and implement an archiving strategy. The initiative, funded through a Press Forward grant, aims to train 300 newsrooms in digital preservation and in using the Internet Archive’s services by the end of 2027.”

Some other archival efforts exist, but they often involve paywalled access; again a problem when you’ve got an authoritarian corporate coalition driven heavily by free propaganda, while factual reality and what’s left of intelligent U.S. analysis and journalism sits hidden behind a monthly subscription fee.

Filed Under: , , , , , ,
Companies: advance media, gannett, internet archive, mcclatchy, medianews

Rate this comment as insightful
Rate this comment as funny
You have rated this comment as insightful
You have rated this comment as funny
Flag this comment as abusive/trolling/spam
You have flagged this comment
The first word has already been claimed
The last word has already been claimed
Insightful Lightbulb icon Funny Laughing icon Abusive/trolling/spam Flag icon Insightful badge Lightbulb icon Funny badge Laughing icon Comments icon

Comments on “340 Local News Outlets Now Blocking The Internet Archive”

Subscribe: RSS Leave a comment
5 Comments
Anonymous Coward says:

We run an archive

We’ve been forced to put a wall in front of it — a free one, but still a wall — because the web crawlers operated by AI companies not only took everything regardless of permissions/copyright, but they kept hitting it thousands of times at high speed from locations all over the Internet/world, thus creating a DDoS attack.

If you’re about to write “why didn’t you…?” I know. I’m intimately familiar with defenses against attacks and abuse, including sharing co-credit for inventing one. I know pretty much every possible way to defend an online operation and I know pretty much everything about those methods. The way we chose was the last option we wanted, we did everything possible to avoid using it — at considerable trouble and expense — but it’s the only way that works.

So don’t blame the IA. Put the blame squarely where it belongs: on sociopathic assholes like Sam Altman and Mark Zuckerberg and Elon Musk et.al.

And by the way: this isn’t an accident. They want this, because every free archive, every unpaywalled news operation, every open web site, is a giving away for free what they want to charge for.

Anonymous Coward says:

I noticed yesterday that NYT’s hostility to users is now so bad that archive.is can’t capture it reliably. Archive.org is into legality and respectability and Archive.is is run by one single Russian asshole who can’t keep up. We need a new, rogue archival project that leverages residential IPs. If only that were in Anna’s remit.

Arianity (profile) says:

Trying to address publisher concerns, the folks at the Wayback Machine have highlighted ongoing efforts to minimize abuse of the site, including restrictions on bulk downloading and collaborating with Cloudflare to monitor bot activity.

Unfortunately, it doesn’t seem to be as effective as they’d like to make it out to be. :/

And I’m not really sure how you fix it, AI companies pissing in the commons seems to have ruined it for everyone.

Add Your Comment

Your email address will not be published. Required fields are marked *

Have a Techdirt Account? Sign in now. Want one? Register here

Comment Options:

Make this the or (get credits or sign in to see balance) what's this?

What's this?

Techdirt community members with Techdirt Credits can spotlight a comment as either the "First Word" or "Last Word" on a particular comment thread. Credits can be purchased at the Techdirt Insider Shop »

Follow Techdirt

Techdirt Daily Newsletter

Subscribe to Our Newsletter

Get all our posts in your inbox with the Techdirt Daily Newsletter!

We don’t spam. Read our privacy policy for more info.

Ctrl-Alt-Speech

A weekly news podcast from
Mike Masnick & Ben Whitelaw

Subscribe now to Ctrl-Alt-Speech »
Techdirt Deals
Techdirt Insider Discord
The latest chatter on the Techdirt Insider Discord channel...
Loading...