We’re Walling Off The Open Internet To Stop AI—And It May End Up Breaking Everything Else
from the how-open-is-it? dept
A longtime open internet activist recently asked me whether I’d reversed my position on internet openness and copyright because of AI. The question caught me off guard—until I realized what he was seeing. Across the tech policy world, people who spent decades fighting for an open, accessible internet are now cheering as that same internet gets locked down, walled off, and restricted. Their reasoning? If it hurts AI companies, it must be good.
This is a profound mistake that threatens the very principles these advocates once championed.
There are plenty of reasons to be concerned about LLM/AI tools these days, in terms of how they can be overhyped, how they can be misused, and certainly over who has power and control over the systems. But it’s deeply concerning to me how many people who supported an open internet and the fundamental principles that underlie that have now given up on those principles because they see that some AI companies might benefit from an open internet.
The problem isn’t just ideological—it’s practical. We’re watching the construction of a fundamentally different internet, one where access is controlled by gatekeepers and paywalls rather than governed by open protocols and user choice. And we’re doing it in the name of stopping AI companies, even though the real result will be to concentrate even more power in the hands of those same large tech companies while making the internet less useful for everyone else.
The shift toward a closed internet shifted into high gear, to some extent, with Cloudflare launching its pay-per-crawl feature. I will admit that when I first saw this announcement, it intrigued me. It would sure be nice for Techdirt if we suddenly started getting random checks from AI companies for crawling the more than 80k articles we’ve written that are then fueling their LLMs.
But, also, I recognize that even having 80k high-quality (if I say so myself) articles is probably worth… not very much. LLMs are based on feeding billions of pieces of content—articles, websites, comments, pdfs, videos, books, etc—into a transformer tool to make the LLMs work. Any individual piece of content (or even 80k pieces of content) is actually not worth that much. So, even if Cloudflare’s system got anyone to pay, the net effect for almost everyone online would be… tiny.
Of course, history has also shown that those setting up the tollbooths to be aggregators of such payments often do quite well. So I’m sure Cloudflare might do quite well out of this deal (and, honestly, I would trust Cloudflare to do a better job of this than many other companies, given its history). But the tollbooth/aggregators quite often become corrupt. Research on the history of these kinds of “collective licensing” intermediaries shows a long trail of corruption and other problems.
More concerning than the economic model, though, was what came next. None of this is to suggest Cloudflare will definitely go down the road of corruption, but the temptations will be there. And indeed, a secondary announcement from Cloudflare revealed a fundamental confusion about what kinds of internet access should be restricted. Last month, it accused AI company Perplexity of “using stealth, undeclared crawlers to evade website no-crawl directives.”
Plenty of people reacted angrily to the story, arguing it was proof of bad behavior on Perplexity’s part, but the details suggest that Cloudflare was conflating very different activities. It’s one thing to block scraper bots that are building up an index of content for training an LLM. That’s an area where it seems reasonable for some to choose to block those bots.
But what Cloudflare described was something different entirely:
We created multiple brand-new domains, similar to
testexample.comandsecretexample.com_. These domains were newly purchased and had not yet been indexed by any search engine nor made publicly accessible in any discoverable way. We implemented a_robots.txtfile with directives to stop any respectful bots from accessing any part of a website….We conducted an experiment by querying Perplexity AI with questions about these domains, and discovered Perplexity was still providing detailed information regarding the exact content hosted on each of these restricted domains. This response was unexpected, as we had taken all necessary precautions to prevent this data from being retrievable by their crawlers.
This is where the anti-AI sentiment becomes genuinely dangerous to internet openness. It’s one thing to say “no general scraping bots” but what Cloudflare is describing here is something much more fundamental: they want robots.txt files to control not just automated crawling, but individual user queries. That’s not protecting against bulk AI training—that’s breaking how the web works.
Let me give an example that hopefully clarifies why I find this problematic. A year and a half ago, I wrote about how I use LLM tools at Techdirt to help me with editing. A lot has changed in the 17 months since that was written, but I still use the same tool, Lex, to help me with editing what I write. And one thing I’ve found to be super useful in my final edit is that I give the tool a list of all sources I used in writing the article so that it can fact check (it will also search other sources for me, which is quite useful as it will—with surprising frequency—find useful sources to add more relevant information to an article).
But, increasingly, I’m finding that for certain news sites, it refuses to read them, and I’m guessing it’s because of various lawsuits some publishers have filed. So, for example, I find that the tool I use refuses to read NY Times or NBC News stories. But, I’m not trying to train an AI on those articles. I’m just asking it to read over the article, read over what I’ve written, and give me a sense of whether or not it believes I’m writing a fair assessment based on those articles.
When the AI is able to read that content, I find it incredibly useful in making sure that my reporting is accurate and clear. But there are times I’m unable to, because these publishers have taken such an extreme view of these tools that they seek to block any and all access.
This illustrates the core problem: we’re not just blocking bulk AI training anymore. We’re blocking legitimate individual use of AI tools to access and analyze web content. That’s not protecting creator rights—that’s breaking the fundamental promise of the web that if you publish something publicly, people should be able to access and use it.
Consider the broader implications: if we normalize blocking AI tools from accessing web content, where does it end? We’ve talked in the past about how many visually impaired users rely on technological tools to “read” websites for them. If we establish that all technological intermediary tools can be blocked without payment, we’re not just hurting AI companies—we’re potentially breaking accessibility tools that people depend on.
There’s a world of difference between “scrape this site to add it to a massive corpus of data” and “hey, can you just look at this one site to see what it says?” One is a big scraping job and one is simply a user-directed prompt.
Cloudflare’s complaint against Perplexity seems to conflate the two and pretend they’re the same. And I wasn’t the only one who noticed how odd this is, especially if you believe in an open web. On an open web, if I point a browsing tool at an open website, the tool should be able to read that website.
The collateral damage from this conflation is already spreading beyond AI companies.
Take, for example, Reddit telling the Internet Archive that it was going to start blocking its crawler from archiving Reddit feeds, because it was worried that AI companies were simply getting access to its content (that Reddit now is looking to license) by going to the Wayback Machine instead.
Here we see the real economic driver behind much of this: Reddit has discovered that user-generated content can be a revenue stream through AI licensing deals. But rather than finding ways to capture that value while preserving archival access, they’re choosing to break historical preservation entirely. We’re losing decades of human discourse and cultural history because Reddit wants to ensure AI companies pay for access to fresh content.
All of this suggests we’re moving very far away from an open internet, and towards one where it’s not just “pay to crawl” but it’s “pay to click” to get access to anything online.
Common Crawl, a non-profit at the center of some of these fights, is finding itself in a tough spot as well. It’s spent many years creating incredibly important and useful archives of the web. Those archives have been essential for many important research projects. But the Common Crawl archives have also been quite useful to LLM companies, and Common Crawl has been trying to navigate all of this. Unlike some others, its scanning bot is quite clear about who it is and seeks to be as “friendly” as a scraping bot can be. It’s not trying to sneak around, yet it’s suddenly facing challenges where it can’t accurately archive large parts of the web any more.
The Common Crawl situation perfectly illustrates how anti-AI sentiment is destroying valuable public resources. Common Crawl has been crucial for academic research, journalism, and public interest projects for over a decade. Researchers have used its archives to study everything from the spread of misinformation to the evolution of web technologies. But because AI companies also found the archives useful, Common Crawl is now being shut out of large parts of the web.
This is the definition of cutting off your nose to spite your face. We’re destroying a public good that benefits researchers, journalists, and civil society because we’re afraid that AI companies might also benefit from it.
And all that means that the web isn’t that open anymore. And that’s sad to think about.
Common Crawl is now suggesting that more forward-thinking companies will start thinking of enabling open crawling of their websites as an updated form of “search engine optimization,” or, in this case, AI optimization, and at least some companies seem to be agreeing, as managers want information about them or linking them to appear in AI searches as more searches go to LLMs instead of traditional search queries:
A significant number of websites currently block CCBot (Common Crawl’s web crawler), often without realizing its role in the ML and research ecosystems. Common Crawl publishes monthly web datasets which serve as foundational training data for major AI models and research initiatives.
As one SEO Ash Nallawalla (Author of The Accidental SEO Manager) wrote:
“A manager asked me why our leading brand was not mentioned by an AI platform, which mentioned obscure competitors instead. I found that we had been blocking ccBot for some years, because some sites were scraping our content indirectly. After some discussion, we felt that allowing LLM crawlers was more beneficial than the risk of being scraped, so we revised our exclusion list.”
If CCBot can’t crawl your site, your content is absent from one of the key datasets on which AI models are trained, potentially making your brand less visible in AI-powered search results.
This quote reveals the fundamental tension in the current approach. Companies are discovering that blocking AI access doesn’t just prevent training—it makes them invisible in an increasingly AI-mediated web. As Judge Mehta just noted in the Google antitrust remedies ruling, AI is beginning to encroach on the historical search market. As more people use AI tools for search and research, being blocked from AI training datasets means being blocked from discoverability.
We’re creating a two-tier internet: sites that can be found and accessed through modern tools, and sites that can’t. Guess which tier will thrive?
In other words, there is a lot going on across the board here. You have some companies who want to appear in AI results. You have some (including us at Techdirt!) who don’t mind it when AI scanners crawl and learn from our content, so long as they don’t take down our servers.
But, increasingly, we’re seeing people have such a negative, knee-jerk, anti-AI stance that they may be shutting off access to the web in a manner that could lead to the death of an open web, and could lead much more towards a pay-to-access model on the web, which I think is a result that most of us would regret.
And this is what I fear we’re going to end up with: an internet where large platforms control access through licensing deals and technical restrictions, where public archives are neutered to prevent AI companies from accessing them, and where individual users can’t use modern tools to access and analyze web content. It’s a world where Google, Microsoft, and Meta get special access through billion-dollar licensing deals while everyone else—researchers, journalists, small businesses, individual users—gets locked out.
The power and excitement of an open web was that it was open and accessible to all. The web’s core principle wasn’t “open to everyone except the technologies we don’t like.” It was “open, period.” Once we start making exceptions based on who might benefit or what technology might be used to access content, we’ve abandoned that principle entirely.
We’re not protecting creators or preserving the open internet—we’re helping to destroy it. The real winners in this new world won’t be individual writers or small publishers. They’ll be the same large tech companies that can afford licensing deals and that have the resources to navigate an increasingly complex web of access restrictions. The losers will be everyone else: users, researchers, archivists, and the long tail of creators who benefit from an open, discoverable web.
None of this means we should ignore legitimate concerns about AI training or creator compensation. But we should address those concerns through mechanisms that preserve internet openness rather than destroy it. That might mean new business models, better attribution systems, or novel approaches to creator compensation. What it shouldn’t mean is abandoning the fundamental architecture of the web.
And that would be unfortunate for all of us.
Filed Under: ai, bots, crawling, intermediaries, open internet, scraping
Companies: cloudflare, common crawl, internet archive, reddit


Comments on “We’re Walling Off The Open Internet To Stop AI—And It May End Up Breaking Everything Else”
Ugh, it just gets worse and worse still, doesn’t it?
robots.txthave been created to limit what search engine are allowed to crawl (mostly to prevent some irrelevant pages, like sign-in pages, from being listed in search engine results).When search engines became wildly used, and Google the main page of internet, most
robots.txthave became less strict on forbidden content, but also forbidding other search engines (Bing has a long history of impersonate Google, or even crawling Google, to be able to create its own index).If LLMs become a wider used tool (and it seems it starts going this way), websites would let them more openly, just as there is so much search engines today than when Google started.
The main difference is that LLMs doesn’t produce much traffic to websites (when they can remember the sources), still hallucinate greatly from the crawled content, and the few major AI companies receive tremendous amount of cash to crawl the web when small websites crawled may struggle to pay the bills.
What are people supposed to do though?
Traffic isn’t free, and in some cases Ai training is equivalent to a ddos.
And those using Ai scrapping behave exactly like bad actors.
So should anyone hosting a website just bend over and pay the cost all the scrapping uses?
Maybe, just maybe, we should take the ai fuckers and whip them for refusing to even be civil instead of having everyone else bend over to them.
Don’t blame the victims, blame the people hammering smaller sites on the scale of a DDoS attack while refusing to even contemplate obeying the rules that built the open internet. It isn’t knee jerk when the attack has been ongoing for years now and the costs of running anything is going up while income from advertisements and views from actual humans is decreasing.
The tech giants have created the wall that is destroying the internet, AI results have been erected to stand as a wall between people googling or bing-ing something and external sites to stop users going elsewhere and to obliterate the amount of money they have to pay people who run their adverts. What choice have the tech giants left people but to block them by any means necessary given all that they take while trying to give nothing in return?
Re:
The thing is, everyone is a victim, not just the sites getting hammered. The value of the internet is its openness. You can always route around one obstacle to your own education and freedom by finding another path, but if it all starts to get walled off, the utility and thus the freedom goes down. It’s important not to kill the internet to “save” it.
Aegrescit medendo.
Re:
Lots of abusers claim to be “victims”. That doesn’t make it okay.
I’m a victim of this “robot”-blocking. I went to check for a new OpenWRT release, and was instead told to prove I was human. Their “downloads” sub-domain is still accessible, at least. I can’t read Linux kernel mailing list messages on the web either, but they still provide “public-inbox” archives.
We’re basically told to take everyone’s word that robot traffic is harmful, although we have little actual data. I have my doubts. As has been pointed out here, there was a lot of work 25 years ago to support 10,000 clients at a time; but that work was finished a decade later, by which time a single server could do about 1-10 million. How many crawlers are there?
I think it’s more likely that people squandered decades of computing-power gains and just got away with it till more crawlers appeared. But if anyone’s to compete with Google and Bing, we need more crawlers. Another big project to improve efficiency would be better overall than wasting everyone’s electricity to prove their “humanity” (by running Javascript… which is apparently a thing humans do better than robots? Like that’s gonna stop people with billion-dollar data centers. They’re already getting around this shit.)
Re: Re:
Are you seriously calling people fighting to stop what is effectively a continual DDoS attack, being done by billion dollar companies hell bent on cutting them off from any potential traffic, abusers? Get a grip.
Re: Re: Re:
If cancer could speak, it would make similar complaints about chemotherapy being so bad for the body.
Re: Re:
Good lord. This is why I miss fuckinggoogleit.
https://www.theregister.com/2025/08/21/ai_crawler_traffic/
https://news.designrush.com/80-percent-of-web-traffic-is-bots-the-hidden-cost-of-ai-scraping
https://news.ycombinator.com/item?id=45105230
https://www.404media.co/ai-scraping-bots-are-breaking-open-libraries-archives-and-museums/
“As has been pointed out here, there was a lot of work 25 years ago to support 10,000 clients at a time; but that work was finished a decade later, by which time a single server could do about 1-10 million. ”
This is just a complete misunderstanding. Your car can do 100 mph, why don’t you drive that constantly everywhere? Yes. Things can be built to scale out and handle incredible load, BUT THAT IS NOT FREE. In whatever field you work in, how many customers, widgets, or whatever can one person handle at a time? It’s infinite right? No. A person can only do so much in x amount of time and then you need to hire more, then you need managers, then you need a new building, and then.. Computers work the same way, and people have budgets.
“I think it’s more likely that people squandered decades of computing-power gains and just got away with it till more crawlers appeared.”
Go run a website you make no money off of, lets see how long you can afford it.
Re: Re: Re:
Your 404 Media link is paywalled, and the Ycombinator link is just another link to the Register story. The Fastly report mentions 39,000 requests per minute, which doesn’t seem like a huge number for a large site; only 650 per second. And that was one instance on one site, not a common thing.
Given how often regular people are accused of being bots—it’s happened to my grandparents who have no non-standard browser settings at all (and, having no idea what to do, couldn’t use that news site anymore)—I have to seriously question those numbers anyway. Design Rush references “proprietary data”. Fastly references “heuristics” and classifies 87% of bot traffic as “malicious”, which leaves at most 13% as “A.I.” (lumped in with search engines, the Internet Archive, and other such things). But it says 90% of “A.I.” traffic is Meta, Google, and OpenAI. If they can identify those crawlers, which I think do identify themselves, why not just block them, and leave the humans alone? That’d take “A.I.” traffic from 13% of all bot traffic to 1.3%.
This just seems like another unjustified “A.I. freakout”, though: focus on the 13% supporting the desired narrative, and nevermind the 87% that is account cracking, “ad fraud”, and so on. And there’s no comparison to historical numbers. Before Google took over, we had dozens of search engines crawling the web constantly; on weaker servers with slower connections and fewer “unlimited traffic” options. Is the current situation worse?
Re: Re: Re:2
… Please, shutup. You clearly know and understand NOTHING and have zero interest in learning.
“The Fastly report mentions 39,000 requests per minute”
Do you have a brain? Have you ever thought that not all tasks are equal?
Go run around the world, it’s just as as as clapping your hands!
“If they can identify those crawlers, which I think do identify themselves, why not just block them, and leave the humans alone? ”
GO FUCKING READ! For fucks sake, the only thing worse them a trump loving pedo is the willfully stupid.
Re: Re:
i weep for you.
Re:
This. Although I’ll point out that it’s not equivalent to a DDoS attack, it is a DOoS attack.
I’ve spent decades studying attacks and abuse, and this is one of the worst I’ve ever seen. It’s massive, it’s relentless, and the people behind it simply don’t care what they destroy. They’re using every creative/duplicitous trick in the book to avoid being held accountable and to evade countermeasures. That’s why there are all kinds of public and private anti-AI-crawler projects, large and small, that should never have been necessary — but ARE necessary, because these attacks are knocking sites off the air and costing a lot of people a lot of time, money, effort, lost sleep, and everything else.
Don’t blame the victims of the sociopathic greedy thugs at AI companies for this mess. They could have chosen to play in nice in the sandbox, in the spirit of collaboration that we used to build the Internet. But no. They decided to be complete assholes, so they — and you, and everyone else — should not be surprised that we’ve decided not to put up with this nonsense.
Re: Re:
Ah, the words of someone highly expert which (and whom) i can respect.
I remember the Old search engines.
I could find Tons of things that NOW you Cant. And I would have links if the last 2 computers had saved all the links to the net.
I can find many things Similar on YT, Now, but they are being BEATEN ON DAILY.
How many OLD Video sites are still around? Like Daily motion?(dont go there), And other that have been in cout so many times they CANT MAKE MONEY unless they HIDE.
I mentioned something(that really wont work) that there are 15 nations NOT dealing with Copyrights or IP. AND not acknowledging International Copyrights. And how it would be interesting to put up Porn sites and Other BANNED sites in those nations. LIKE Copies of Movies and music, that have Been MONITIZED to the point you CANT OWN ANYTHING, and Companies can Actually REMOVE, without Notice. Data they they placed and SOLD to you.(how many people LIKE ITUNES, NOT).
Re: Good news and bad news
Good news, I think this poster is in fact human. Bad news, the word salad of CAPS and (parentheses) makes things incomprehensible.
On the bright side I cant find you on Google.
Re: Re:
They are human and a long-time resident of the commentariat. Mostly agreeably incoherent.
Re: Re: Re:
I can’t tell whether they’re agreeable or not as I cannot read their word salad & capitalisation nightmare-fuel comments.
You’re calling it knee-jerk. But a lot of this stuff is in direct response to things those AI companies themselves are doing, like not respecting robots.txt. For instance, in your example between a crawler and a user agent. The problem is AI companies will literally lie about being user agent, and crawl it anyway. (This is, not coincidentally, a big part of why AI companies are interested in making browsers. They get to do local queries that look like you)
AI companies have been maximally shitty stewards of the open web. This is what happens when you have wide scale irresponsible use. This doesn’t even get to ethics on training, it’s “stop pounding my server into dust and costing me money I can’t afford”. Fundamentally, part of having an open web is that people need to use it responsibly, and AI companies aren’t. You can’t have an open system that’s dependent on hostile users. This isn’t new, if you had an abusive crawler in the past you’d get blocked, too.
Yeah, the problem is they are in fact doing that for a lot of people’s servers. And even when it doesn’t take it down, you’re paying for it. How much are you ok with paying, especially when the same crawler hits you repeatedly instead of caching? And what about smaller sites that can’t afford it? AI crawling is killing little sites, too.
If we’re going to tell people to nerd harder, a good place to start might be AI companies respecting the commons. You can’t complain about people leaving the pool after you piss in it.
Re: AT LEAST
Put up the PRivacy act..
A REAL protection of the data and personal info.
YOU CAN be tracked, by anyone, if they know how the system work..
At least when we had Hard wired Phones All they got as an Address. From the Phone book.
Re:
This.
It’s like that gop lawsuit and investigating into gmail spam blocking. If you do not want to be treated like a bad actor then don’t act like one.
Re: Re:
Oho. This. This, indeed, and most emphatically.
Another issue with AI scraping that’s overlooked is that it completely fucking hammers the people’s compute resources.
I run a website and I have been dealing with the onslaught of AI scrapers for months, they reduce the hardware I’m running the site on to skin and bones. They intentionally try to make themselves blend in with other users. I’ve had to create incredibly complex rulesets and stuff, completely ban lots of crawlers, make users on certain ISPs and even entire countries fill out a captcha before accessing the website, etc, to be able to alleviate the load that these scrapers put on the website. The end result is that my users now have a way worse experience and many of them have to fill out captchas.
Another website I use quite frequently that allows people to make freedom of information requests in New Zealand is also suffering from AI scrapers. For months its been incredibly slow and sometimes even completely down at times, and recently the admins put up a notice saying the websites issues are being caused by the load AI scrapers are putting on their resources.
These AI scrapers are the devil and are completely killing the Internet by making it incredibly difficult for independent website operators to be able to run without being concerned about this nonsense.
Re:
“Another issue with AI scraping that’s overlooked is that it completely fucking hammers the people’s compute resources.”
Yes.
Even those of us with a lot of experience in performance tuning — at the network level, at the web server level, at the OS level, etc. — are finding that we can’t maintain previous service levels without large investments of time and money. Because of the thugs at AI companies.
And even when we do that — as I did last year when I replaced a server that was working perfectly fine pre-AI-crawlers with one that cost three times as much — the respite is only temporary. The crawlers effective negated all that money and all the accompanying work in just a few months. Because of the thugs at AI companies.
Other people that I work with, collaborate with, or just correspond with have gone through similar things. A lot of them aren’t trying to expand their sites or improve them because they can’t — all of their resources are going into just trying to survive. And just as I discovered, they know that anything they add will just make their sites a bigger target. Because of the thugs at AI companies.
Libraries, museums, archives, and other resources that are perpetually starved for money, places where people work out of dedication to the concept, not because they’re going to get rich, are being decimated. Long-standing resources in science and technology are asking for help that they never needed before. Because of the thugs at AI companies.
We don’t need new business models, we don’t need to nerd harder, we don’t need any of that crap. What we need is for the thugs at these AI companies to behave like (at least) minimally decent human beings. There’s no tech fix if they don’t…
…although there may be a legal one. There are ongoing discussions of a massive class-action lawsuit. It’s unclear that’ll go anywhere, but I certainly would applaud it, provided it was for at least $1T — and that, by the way, is likely a serious underestimate of the aggregate cost of all this.
this is a firm, firm no, Mike.
I’m sorry, but letting AI abuse the everloving hell out of websites, after they have exploited the everloving hell out of our economy and copyright in the first place? No.
The dollar cost to the world from running AI is astronomical on top of it, so this is literally asking the victims why they won’t stop hitting themselves.
We need a significant tax on power and space from AI usage in datacenters, and arguably a ban on AI from nonscientific sources entirely
https://searchengineland.com/google-web-thriving-dying-461653 is as clear as day on both the source and the outcome.
Re:
And when they inevitably implode, the VC boyz are gonna make everyone else feel it.
The so-called open web
What a horrendous take.
The promise of the open web was made in quite literally a different era. I had my first webpage in the 90s, and it was hosted on a University server where I was a student. Then there was geocities and other similar sites. For a long time, the open web was predicated on the fact that text pages have almost no bandwidth costs and Universities could act as repositories of useful knowledge, or that small ads could keep servers running. A few pennies here, a few pennies there, and eventually you were talking about real money.
Now it’s just Facebook and Google making the lion’s share of the ad money, and their revenue is declining. AI scrapers have all of the downsides of search engine browsing with none of the upsides: they’re bandwidth intensive, but they bring you no traffic at all. If you have any sort of ad support, you’re guaranteed no clickthrough. If you’re relying on traffic to drive engagement and possibly a subscription or a patreon payment, you’re super boned–the LLM isn’t going to bring someone to your site, they’re just gonna gobble up your content and you’ll never see any benefit.
AI is the death of the open web because the model doesn’t work anymore. “Oh no,” you’re saying, “if AI can’t scrape the web, the information is kept out of the hands of people!” You’ve forgotten that if nobody can afford to keep their websites running, the information ALSO disappears.
When there’s a shared pasture, farmers will come together and enforce the usage because otherwise we see the tragedy of the commons. But that particular case was less frequent than you’d expect because people want to get along with their neighbours, and everyone can be made to understand what the common good is.
But faceless corporations do not care about getting along with you. They will take your information, eat your bandwidth, drive you off the web and never look back. They haven’t even tried to come up with accommodations, they just consume endlessly and try to sell your own content back to you as masticated slop.
Re:
You really enjoy keying your own car to prevent vandalism too, right?
“Why does everyone want to fight against the Tragedy being inflicted upon this Commons??”
Jeez, with how much criticism you’re getting, Mike, I’m wondering if you’re alright?
Re:
Hmm? I see some criticism of the post, but nothing particularly harsh. It’s an opinion piece, and I fully expected some people to disagree. I still stand by the piece and think that it raises key points missed by many of the critics, but why should I not feel alright?
Re: Re:
I was just wondering.
Re: Re:
Maybe they’re someone who takes criticism extra-personally and therefore equates criticism to, I’unno, being stabbed in the gut or something.
Re: Re: Re:
Hahahahahaha…. No.
I’m not that guy, though I’d admit I act like that.
I understand everyone has different opinions, but there’s 26 (at the time) comments that seemed to criticize him.
Luckily, he expected the criticism.
Still, that was funny.
Re: Re: Re:
Or just the normal troll behavior of suggesting one may or should not be all right, largely directed to the public, but if it pokes presumed target, that’s a bonus.
Re: Re:
I’m wondering how you would feel if your website got ddosed and you had to pay for the server load. I think there is some seriously lack of understanding of how websites, or more accurately these days, web applications, work and how expensive, or just dangerous it can be to just let all traffic through all the time.
The overall point this piece makes is bend over and take it. It ignores the history and context of the overall issue. We wouldn’t be here had ai companies decided that following the rules, the law, or even just etiquette. Ai companies purposely decided to hide who they were and what they were doing for years just so people couldn’t refuse them. Then they decided that the systems setup previously they would just ignore. Then they even went so far as to just straight up pirate stuff and start committing theft because working with people above board was just too hard and slow.
So no, there are no good points, except maybe that this is the same type of absolute horse shit that those who defend abusers say. You could replace this article with one about a wife hurting her husband because she left him over his abuse.
Re: Re: Re:
I mean, I run THIS website, and yes, on occasion we have gotten DDoSed, whether on purpose or not, and we figure out ways to block it and move on.
It suggests no such thing. I am curious what you read, because it was not this piece.
I repeat. It suggests no such thing.
I mean, most of that is either untrue or misleading, which makes me question why you feel the need to opine on something you appear to know little about.
You do not seem particularly connected to reality.
Re: Re: Re:2
What parts of that are untrue or misleading?
Re: Re: Re:
That’s fucking harsh.
Geez.
Also:
We wouldn’t be here had ai companies decided that following the rules, the law, or even just etiquette.
👆
The sad but true part here. 😔
Agreed
I’m broadly on board with your take Mike. And more than a little saddened to look around at people in my generation, who also grew up with the open web, seemingly decide that everything we did then was wrong and that, in fact, we should not allow easy, free, open access to information, that we should actually abandon broad access to information and instead sway towards tighter copyright, age verification, and a wide variety of blocks and checks of various reasons, because maybe it will hurt an AI company somewhere (except, based on everything I’ve seen, there’s an essentially zero percent change it will).
Re:
There areat leasttwo things you are (intentionally?) conflating there. One is the stupid human reactions to whatever, which are as stupid as AI companies – no good side of the supposed two there. Then there are those who are literally just trying to defend their networks and compute; and surely it is a given that all are not doing so optimally, but who the hell could reasonably expect that?
Access controls are not optional
So if AI should go ahead and ignore robots.txt in response to a user’s query, should it also look for bugs in login prompts so it can get through those as well?
If you want to argue that site admins shouldn’t configure those files for blocking AI I could maybe be convinced; I haven’t bothered to put any on my websites…but deliberately ignoring access control mechanisms configured by the webmaster, especially at scale and as what seems to be as a matter of corporate policy, is absolutely not OK. Pretty sure that’s a federal crime here in the US under the CFAA. The fact that it’s merely a robots.txt file doesn’t seem to matter; the law only refers to “exceeding authorized access”, and from what I’ve seen it seems pretty clear that the AI scrapers are DEFINITELY doing that.
A big part of the reason there is so much collateral damage is because so many of the AI companies are deliberately ignoring the orders not to scrape sites. So the admins have to take a much heavier approach. If you just say “No AI”, they ignore you and sometimes even change user agents and such to pretend to be something else. The AI companies are intentionally ruining it for everyone else. But you’d rather blame literally anyone else in order to defend the AI companies’ industrial-scale criminal vandalism of the web…
Yes, the Cloudflare/Perplexity thing is as stupid as everyone else who “proves” their work is “stolen” by repeatedly prompting AI to look it up.
But you know what? An increasingly AI-mediated web is already not an open web. Never mind the external costs – since you never do anyway – with absolutely zero value as a result. So maybe the web has to die a closed death so we can reinvent it when we grow the fuck up a little.
We don’teven have to put garbage in anymore – though we certainly do! – to get the innovation of hot garbage out.
Copy-Paste the article into the AI prompt instead of the link
Sure, it’s a pain, but it works. I may write a script to do this, or better yet, search for one, because someone like me has probably already done so. Soon it will be a simple yet popular app to make everyone’s life easier.
open web blackout day
Hey, remember the SOPA blackout day, January 18, 2012.
https://en.m.wikipedia.org/wiki/Protests_against_SOPA_and_PIPA
Maybe we should organize a day like that, sites that are suffering from ai crawling ddos, all go black together.
Raise awareness and make it hurt!
It's blowing my mind that there's an argument here
Basically every argument in this thread: AI is basically a DDOS attack, loaded with actively engaged high income customers I don’t want! I refuse to draw a distinction between training spiders and lucrative users who DARE use AI to find products and services. I only want customers to come in the old fashioned way. The nerve of these people. They’re the ones that are destroying the internet. Talk to them about it!
Re:
So I can go ahead and break into your home at night — or even burn the whole house to the ground — as long as I’m considering making a large enough purchase from you at some later date? THAT is your counter-argument??
If the admins don’t want the AIs browsing their site, they have every right to put up those NO TRESPASSING signs. If the billion dollar AI companies choose to ignore those signs, that’s a federal crime here in the US. It doesn’t actually matter how much money these “users” (who are not actually visiting or using the site in question) might hypothetically be willing to pay at some undetermined future date. While folks like Trump certainly like to think that throwing around enough cash means you’re fully above the law, everyone who isn’t a billionaire generally agrees that it really shouldn’t work that way. (And everyone who is a lawyer agrees that the law doesn’t work that way…it’s just that corrupt cops and prosecutors often refuse to enforce it properly. Which is not something that most people defend.)
Re:
They aren’t customers, they are coming in uninvited and taking whatever they find for free so they can erect a wall between actual humans and the sites they are battering with their continual scraping attempts, and deprive them of actual human visitors and advertising revenue. Google are aware what they are doing is strangling the life out of the open internet, but they’re doing it anyway… But sure, it’s the victims of this that are the bad guys for pushing back.
Just like I do. Gonna block me too?
interesting
i am on the sidelines to all this as a web designer and host of a few small sites. I am not seriously effected as my income is not directly connected to traffic but I run several small aws lightsail instances and they have been wiped off the face of the earth, no humans, just crawlers of some sort. i really didn’t understand it, but chatgtp has helped me at least get my burst level to below maximum. the bots effectivly made my sites unusable for over a year and even now i am only available via a bunch of cloudflare rules i dont understand. i will expand if this post works, save writing lots and finding i need an account
more interesting
i am only commenting as this feels like the end really and people might be interested in my position. it is however slightly grim or controversial. i run several sites, one on war, and another few on true crime, none of which are pc sanitised, but none of which are gross. i operate ethically, but i dont think the powers that be like me. straight off, i run black kalendar dot nl, i doubt you’ll find it in google as it cant index it for some reason. this is not a major issue as i was never about traffic and it means my site is still there and i can work on it. it has about 40k cases that i wrote over 15 years or so. its very unusual and i imagine a target for ai crawlers. i have responded by setting all sorts of rules which i dont understand resulting in google not being able to index it. i also chopped off all content beyond seconf
tag. some stories were 100k words or longer. my situation with this is that AI bots were in theory stealing all my content. however, the irony is that many cant reproduce it because thay sanitise crime information. I have tried question cases in chatgtp and it knows nothing about them. so i get hammered and none of the information is used directly. however, my only fear is that someone could quickly reproduce my 15 years of work. Much of the information come from secure archive data that you have to access physically, one case at a time, so only I have it. I have converted some of it into books, but now, so can other people. so what you have here is a small party that can barely afford a simple 4gb ram lamp stack on aws being effectivly destroyed by bots that don’t care. I sort of dont care as its more a ‘labour of love’, although its certainly not ‘love’, and not attached to income diectly, although I do sell books. part of my point is that I am the sole supplier of much of this information which is very unique and its clear that suddenly anyone can use it via a prompt. I also do a war site, moderwar dot games and the situation is exactly the same, red lined on the cpu burst for over a year. there preople subscribe to play. I have had no complaints, but I dont really have any users, maybe 1 or 2 at a time. However, the bots don’t care, its a living ddos attack. unlike the crime site in wich i fear my proprietry information was being used, there is little value in the information here, tgey are all simply games. I have used chatgtp to recover cpu cost, including my bad coding and am now back afloat. a third example was a customer online shop and his 40k products were a honey pot for bots. I simply didn’t understand what was happening beyond the fact that his site was constantly going offline. it truly soured relations permanently and now he is out of business. so from my perspective ai has destroyed the web. it makes hit counters hard to work with as its all bots. if anything gets any serious size, the cost of the traffic becomes an issue, when dealing with customers now you need to factor in the impact of ai and they think your an amatuer if you have doubts or are vague. they dont want vps solutions and so small sites that have a lot of content tend to get hammered, so much so that people don’t bother. i truly believed in the open web, thats why I did my crime site, it was old school and with no newsletters or ads or anything, it was totally free for innocent humans to read, but the bots destroy bandwith and the possibility that people could replicate the whole thing at a click is such that I have removed virtually all the content to the first to
tags. this is a big blow. suffice to say, I am not too upset. the upside is that I am, whilst continuing, pivoting part of the sites aspect to feeding the ai. Chatgtp tells me that the future is robot web site with schema to feed the ai systems and I am working on structured data based on statistics and playing down on cases as available data. end of it is that things are changing. I never thought I would restrict my content as I have but after 15 years I have no choice.
A suprisingly bad take
I would not have bet on Mike Masnick railing against the basic protections available via the ancient robots.txt standard.
Possible Solution
A solution I’ve toyed with is to require particular crypto wallet address to access your content. I did not say crypto, I said a crypto wallet.
Firstly, it would initially block all bots, crawlers and AI. Secondly, if you know all your visitors are crypto-savvy it helps with building future content/products/offerings.
Eventually, let’s AI starts having its own wallets to use. Fine. Now you add an agreement that says if it’s an LLM and it’s logging in to use your content they must pay you via your receiving wallet address xyz amount of xyz crypto. They have it, you can can receive it, and now AI has a way to actually pay people for what they steal.
If they steal it without paying you, theorhetically you now have a legal case. But long term it’s in AI’s best interest to pay people some crypto for their content, otherwise – the content will go away. People for the most part will not keep producing content for now reward.