Stories filed under: "open web"

The Free And Open Web Is Under Attack At The IETF

from the the-open-web-includes-the-ability-to-scrape dept

Thu, Jun 25th 2026 03:06pm - Tori Noble

The ability to access publicly available information using automated tools is a central value and benefit of a free and open internet. Automated access—often called crawling or scraping—powers important, useful tools for locating, preserving, and analyzing online information. For example, crawling and scraping helps journalists, researchers, and watchdog organizations report the news, find security flaws, and investigate discrimination. Crawling the web allows non-profits like the Internet Archive to preserve historical copies of websites. Tools for automated comparison shopping allow consumers to find the best deals on items they want to buy. And so on.

Yet the open internet access is increasingly under threat from publishers and Big Tech companies alike. Fearing lost advertising and licensing revenues, website operators increasingly claim that they need to lock down their sites from bots that crawl public web content to train or operate AI models. Some companies are even trying to embed their business models into internet standards by changing Internet Engineering Task Force (IETF) technical standards that shape much of the internet.

Many of their economic anxieties are understandable. AI bots can strain websites’ infrastructure, in some cases, degrading site performance or taking them offline altogether. Upgrading systems costs money that some sites may not have. And AI is likely to disrupt the business models many publishers adopted in response to the rise of the internet, if users rely on AI overviews instead of visiting source websites.

However reasonable these fears may be, the answer is not to change the IETF standards from neutral protocols that encourage openness to restrictive requirements designed to monetize internet access.

The worst of these proposed standards would give websites far greater ability to automatically block legitimate, lawful scraping and crawling. For example, the AI Preferences working group is working on proposals to give publishers a way to express “preference signals” against crawling web data for AI-related purposes, including to train models, generate outputs, and help users search the web. These preference signals would be expressed through robots.txt and could potentially become legally binding in some jurisdictions.

Another working group, called Web Bot Auth, is pursuing efforts to protect sites from overly-aggressive bots that strain website resources—a positive goal that could meaningfully improve the internet in the AI era. But Web Bot Auth is simultaneously pursuing a much more dangerous path as well: standards changes that would enable sites to cryptographically identify bots so that they can more easily block anyone they wish—not just “bad” actors, but competitors, dissidents, or anyone who hasn’t paid for the right to access sites using automated tools. If sites restrict crawling to a preapproved list of cryptographically authenticated bots, they could require licensing payments from those wishing to crawl their sites. This would close off the open web to researchers, archivists, and startups without the ability to pay for automated access.

Websites may have legitimate reasons to worry about AI’s impacts on their traffic and advertising revenue, but those reasons must be weighed against the benefits of the open web. These proposals would effectively give website operators veto power over a wide range of important uses—from the investigations and archival works described above to accessibility tools for people with disabilities, to research efforts aimed at holding governments accountable.

That is why we are fighting back against these threats to open access. EFF and our allies in the open internet community have successfully resisted some of the most dangerous IETF proposals thus far—and won’t stop working to protect the open web from efforts to manipulate internet standards to undermine the right to freely access the internet in any legal way, including with automated tools.

Republished from the EFF’s Deeplinks blog.

Filed Under: ai, open access, open web, scraping
Companies: ietf

21 Comments

Expand

Why Google’s New AI-Saturated Search Page Will Be A Disaster

(Mis)Uses of Technology

from the the-end-of-ten-blue-links dept

Thu, Jun 11th 2026 11:05am - Glyn Moody

Google didn’t invent full-text search of the Internet – that honor belongs to early pioneers such as WebCrawler, Lycos and AltaVista. But for the last 25 years or so, Google has been synonymous with online searching, providing the quickest and most effective way to find things online (although its results may be getting worse.) More recently, it has been adding to its search engine more features based on generative AI, first with its AI Overviews in 2024, and then a year later with its AI Mode in Search. Now it has announced the latest stage in that evolution with what it calls “A new era for AI Search”:

It’s more intuitive than ever, dynamically expanding to give you space to describe exactly what you need. Designed to anticipate your intent, it also helps you formulate your question with AI-powered suggestions that go beyond autocomplete. And you can search across modalities, using text, images, files, videos or Chrome tabs as inputs.

This new incarnation effectively turns search into a chatbot:

You can easily ask a follow-up question right from an AI Overview, and flow into a conversational back and forth with AI Mode. Your context stays with you, and as you explore more deeply, the links and supporting articles get even more relevant. This seamless experience is live today across desktop and mobile, worldwide.

As the the screenshot of the new interface above shows, the traditional search result links that are currently placed under the AI Overview have now been confined to a small panel on the right-hand side of the screen, which shows a cut-down version of today’s list. Users are encouraged to ask follow-up questions from the AI search chatbot, rather than exploring the links themselves.

What this is likely to mean in practice is that even fewer people will follow links to sites, something that was already happening last year; instead, they will engage with Google’s chatbot to gather information indirectly. This is terrible news for access to knowledge because it frames the Google AI search engine as the fount of all knowledge – one that will do all the hard work of finding information and combining it into an easily digested answer that can be interrogated further. It can do that because it has already ingested billions of Web pages and other information sources as part of the Large Language Model (LLM) training process. But search engine users will no longer know what some of those sources are unless they painstakingly click on the links in the new panel.

Most people will not bother, because the AI-generated results will be good enough – or at least will appear to be good enough. Unless visitors to the site take the trouble to follow the links to the sources they won’t really know how reliable those results are. For example, it is possible that the sources are wrong, or misleading; moreover, Google’s LLM may itself introduce new errors and distortions. There is also the question of how Google will insert ads into this AI-generated information, and to what extent advertisers will be able to buy preferential treatment in results.

This new mediated approach is clearly terrible news for Wikipedia – an issue already discussed on Walled Culture earlier this year – and for creators. Google will use the information found in their works, but will not actively encourage people to visit the originals. For many people, summaries will be good enough, and they will never discover the greater riches of the sites and creations that Google’s LLM is based on. Worse still, the original creators such as Wikipedia may not even be mentioned in answers that involve aggregating information from a large number of sources.

Similarly, the new Google search is the publishing industry’s worst nightmare. Not only is Google drawing on material they have published, but it is pushing links to those sources into the background. It seems inevitable that the Web traffic to publishers will fall yet further, making already struggling business models based on advertising even more precarious. That will have knock-on consequences for the funding of many sites – particularly newspapers and magazines – and for the commissioning of work from journalists and other creative professionals. Users won’t even need to visit Google Search much in order to keep up-to-date with topics of interest thanks to Google Search’s new agentic capabilities that will do the work for them in advance:

With information agents, you can stay updated on whatever matters most to you. Your agent will intelligently look across everything on the web, like blogs, news sites and social posts, plus our freshest data, such as real-time info on finance, shopping and sports, to monitor for changes related to your specific question.

In this case, not only will people not visit sites, but the latter will be constantly bombarded by various AI bots seeking information on behalf of users – increasing site running costs, and making sites less usable by humans. Another key announcement from Google will lead to a further flood of agentic activities that will pose new challenges to businesses:

We’re also expanding agentic booking capabilities in Search to a wide range of new tasks, including local experiences and services. Just share your specific criteria — like finding a private karaoke room for six on a Friday night that serves food late — and Search brings together the latest pricing and availability with direct links to finish booking through the provider of your choice. And for select categories like home repair, beauty or pet care, you can ask Google to call businesses on your behalf.

What emerges from Google’s latest announcements is less of a search engine, and more of an immersive virtual environment that is designed to keep people engaging with Google’s services, asking them for information, advice and even delegating actions to them. There is no doubt that many users will find these new features attractive, not least because they can use “conversational voice features” in Gmail, Docs and elsewhere. These are the digital assistants that have been promised for many years, able to understand spoken commands, provide information verbally, and carry out complex operations on behalf of users without the need for any complex training. For many people, that will be a boon, and they will doubtless migrate from the traditional search page, which will still be the default – at least for now – to the latest AI-infused version.

But these impressive technical features come at a high price, even leaving aside issues such as the environmental impact of the huge server farms they require. With the latest incarnation of its search engine, Google is making the World Wide Web as we have known it for over 30 years invisible, and therefore increasingly irrelevant to most people, who will be happy to let Google become their universal user interface to everything. And yet Google still depends on the Internet to supply all the information it is analyzing and repackaging. It risks killing the very thing that sustains it.

There’s another, more subtle issue. The new Google search features make finding information and carrying out actions very easy in many ways. Leaving aside the problem that this will require people to trust what is in effect a huge black box, where the internal workings cannot be examined, with all the loss of control this implies, there is another danger. People who use Google’s powerful new AI search services to offload many of their day-to-day actions may gradually lose the ability to understand the world and to act within it without that constant help. Such a dependence may be great for Google and its advertisers, but it surely cannot be a good thing for the future of society.

Follow me @glynmoody on Mastodon and on Bluesky. Originally published to WalledCulture.

Filed Under: ai, links, open web, search
Companies: google

24 Comments

Expand

AI Might Be Our Best Shot At Taking Back The Open Web

Predictions

from the hear-me-out dept

Wed, Mar 25th 2026 09:28am - Mike Masnick

I remember, pretty clearly, my excitement over the early World Wide Web. I had been on the internet for a year or two at that point, mostly using IRC, Usenet, and Gopher (along with email, naturally). Some friends I had met on Usenet were students at the University of Illinois at Urbana-Champaign, and told me to download NCSA Mosaic (this would have been early 1994). And suddenly the possibility of the internet as a visual medium became clear. I rushed down to the university bookstore and picked up a giant 400ish page book on building websites with HTML (I only finally got rid of that book a few years ago). I don’t think I ever read beyond the first chapter. But what I did do was learn how to right click on webpages and “view source.”

And from that, magic came.

I had played around with trying to build websites, and I remember another friend telling me about GeoCities (I can’t quite recall if this was before or after they had changed their name from their original “Beverly Hills Internet”) handing out web sites for free. You just had to create the HTML pages and upload them via FTP.

And so I started designing really crappy websites. I don’t remember what the early ones had, but like all early websites they probably used the blink tag and had under construction images and eventually a “web counter.”

But the thing I do remember was the first time I came across Derek Powazek’s Fray online magazine. It was the first time I had seen a website look beautiful. This was without CSS and without Javascript. I still remember quite clearly an “issue” of Fray that used frames to create some kind of “doors” you could slide open to reveal an article inside.

Right click. View source. Copy. Mess around. A week later I had my own (very different) version of the sliding doors on my GeoCities site, but using the same HTML bones as Derek’s brilliant work.

You could just build stuff. You could look at what others were doing and play around with it. Copy the source, make adjustments, try things, and have something new. There were, certainly, limitations of the technology, but it was incredibly easy for anyone to pick up. Yes, you had to “learn” HTML, but you could pick up enough basics in an afternoon to build a decent looking website.

But then two things happened, and it’s worth separating them because they’re different problems with different causes.

First, the technical barrier went up. CSS and Javascript opened up incredible possibilities to make websites beautiful and interactive, but they also meant it was a lot more difficult to just view source, copy, and mess around. The gap between “basic functional website” and “actually looks good” widened into a chasm that required real expertise to cross. Plenty of dedicated people learned these skills, but the casual tinkerer — the person who’d spend an afternoon copying Derek’s frames to make sliding doors — increasingly couldn’t keep up.

But the technical complexity alone didn’t kill amateur web building. The centralization did. While there was an interim period where people set up their own blogs, it quickly moved to walled “social media gardens” where some giant tech company decided what your page looked like. Why bother learning CSS when you could just dump text in a Facebook box and reach more people? The incentive to build your own thing evaporated, replaced by the convenience of posting to someone else’s platform under someone else’s (hopefully benign) rules.

These two problems reinforced each other. The harder it got to build your own thing, the more attractive the walled gardens became. The more people moved to walled gardens, the less reason there was to learn to build.

The rise of agentic AI tools is opening up an opportunity to bring us back to that original world of wonder where you could just build what you wanted, even without a CS degree. And here I need to be specific about what I mean by “agentic AI” — because too many people are overly focused on the chatbots that answer questions or generate text or images for you. I’m talking about AI systems that can actually do things: write code, execute it, debug it, iterate on it based on your feedback. Tools like Claude Code, Cursor, Codex, Antigravity, or similar coding agents that can take a description of what you want and actually build it.

For all those years that tech bros would shout “learn to code” at journalists, the reality now is that being able to write well and accurately describe things is a superpower that is even better than code. You can tell a coding agent what to do… and for the most part it will do it.

Let me give you the example that still kind of blows my mind. A few weeks ago, in the course of a Saturday — most of which I actually spent building a fence in my yard — I had a coding agent build an entire video conferencing platform. It built a completely functional platform with specific features I’d wanted for years but couldn’t find in existing tools. I’ve now used it for actual staff meetings. The fence took longer to build than the software.

All it took was describing what I wanted to an agent that could code it for me. And it addresses both problems I described earlier: it lowers the technical barrier back down to “can you describe what you want clearly?” while also enabling you to build your own thing rather than accepting whatever some platform offers you.

Over the last few months I’ve been finding I need to retrain my brain a bit about what we accept and learn to deal with vs. what we can fix ourselves. In the past I’ve talked about the learned helplessness many people feel about the tech that we use. We know that it’s vaguely working against us, and we all have to figure out what trade-offs we’re willing to accept to accomplish whatever goals we have.

But what if we could just fix things rather than accepting the tradeoffs?

I’ve talked in the past about how I’ve used an AI-assisted writing tool called Lex over the past few years, which doesn’t write for me, but is a very useful editorial assistant. Over the last few months, though, I decided to see if I could effectively rebuild that tool myself, fully controlled by me, without having to rely on a company that might change or enshittify the app. I actually built it directly into the other big AI experiment I’ve spoken about: my task management tool, which I’ve also moved away from a third party hosting service onto a local machine. Indeed, I’m writing this article right now in this tool (I first created a task to write about it, and then by clicking a checkbox that it was a “writing project” it automatically opens up a blank page for me to write in, and when I’m done, I’ll click a button and it will do a first pass editorial review).

But the amazing thing to me is that I keep remembering I can fix anything I come across that doesn’t work the way I want it to. With any other software I have to adjust. With this software, I just say “oh hey, let’s change this.” I find that a few times a week I’ll make a small tweak here or there that just makes the software even better. In the past, I would just note a slight annoyance and figure out how to just deal with software not working the way I wanted. But now, my mind is open to the fact that I can just make it better. Myself.

An example: literally last night, I realized that the page in the task tool that lists all the writing projects I’m working on was getting cluttered by older completed projects that were listed as still being in “drafting” mode. With other tools (including the old writing tool I was using), I would just learn to mentally compartmentalize the fact that the list of articles was a mess and train myself to ignore the older articles and the digital clutter. But here, I could just lay out the issue to my coding agent, and after some back and forth, we came up with a system whereby once a task on the task management side was checked off as “completed” the corresponding writing project would similarly get marked as completed and then would be hidden away in a minimized list.

I keep coming across little things like this that, in the past, I would have been mildly annoyed by, but needed to live with. And it’s taking some effort to remind myself “wait, I don’t have to live with this, I can fix it.” Rather than training my brain to accept a product that doesn’t do what I want, I can just tell it to work better. And it does.

And, the more I do that, the more I start to open up my mind to possibilities that were impossible before. “Huh, wouldn’t it be nice if this tool also had this other feature? Let’s try it!” I find that the more I do this, the bigger my vision gets of what I can do because the large segment of things that were fundamentally impossible before are now open to me, just by describing what I want.

It really does give me that same underlying feeling that I felt when I was first playing around with HTML and being able to “just make things.” Except, now, it’s way more powerful. Rather than copying Derek’s use of HTML frames to create “sliding doors” on a webpage, I can create basically anything I dream up.

Then, when combined with open social protocols, you can build in social features or identity to any service as well — without having to worry about getting other users. They’re already there. For example, my task management tool sends me a “morning briefing” every day that, among other things, scans through Bluesky to see if there’s anything that might need my attention.

Now, there are legitimate criticisms of “vibe coded” tools. Critics point out that AI-generated code can be buggy, insecure, hard to maintain, and that users who can’t read the code can’t verify what it’s actually doing. These are real concerns — for certain contexts.

The thing is, most of these criticisms apply to tools being built as businesses to serve customers at scale. If you’re shipping code to millions of users who are depending on it, you absolutely need security audits, proper testing, maintainable architecture. But that’s not what I’m talking about. I’m talking about building totally customized, personal tools for yourself—tools where you’re the only user, where the stakes are “my task list doesn’t sync properly” rather than “customer data got leaked.”

There’s also a more subtle concern worth addressing: is this actually democratizing, or does it just shift which skills you need? After all, you still need to accurately describe what you want, debug when things go wrong, and understand what’s even possible. That’s different from learning HTML, but it’s still a skill. I think the honest answer is that the kind of skill needed has shifted. “Learn to code” becomes “learn to think clearly and describe things precisely” — which happens to be a superpower that writers, editors, and domain experts already have. The barrier has moved to territory that many more people already inhabit.

It’s also an area where you can easily start small, learn, and grow. I started by building a few smaller apps with simpler features, but the more I do, the more I realize what’s possible.

Also, I’d note that this is actually an area where the LLM chatbots are kind of useful. Before I kick off an actual project with a coding agent, I’ve found that talking it through with an LLM first helps sharpen my thinking on what to tell the agent. I don’t outsource my mind to the chatbot, and will often reject some of its suggestions, but in having the discussion before setting the agent to work, it often clarifies tradeoffs and makes me consider how to best phrase things when I do move over to the agent.

What gets missed in most conversations about AI and the open web: these two pieces need each other. Open social protocols without AI tools stay stuck in the domain of developers and the highly technical — which is exactly why adoption has been slow. And AI tools without open protocols just replicate the old problem: you’re building cool stuff, but you’re still trapped inside someone else’s walls.

Put them together, though, and something clicks. Open protocols like ATProto give AI agents bounded, consent-driven contexts to work in — your agent can scan your Bluesky feed because the protocol allows that, not because some company decided to grant API access that it could revoke tomorrow. And AI agents give regular people the ability to actually build on those protocols without needing an engineering team. My morning briefing tool scans Bluesky not because I wrote a bunch of API calls, but because I described what I wanted and a coding agent made it happen.

Each piece makes the other more powerful and safer.

Blaine Cook — who was Twitter’s original architect back when it was still a protocol-minded company — recently wrote a piece at New_ Public that gets at this from the infrastructure side:

My long-standing hope has been that we’re able to move past the extractive, monopolizing, and competitive phase of social networks, and into a new era of creativity, collaboration, and diversity. I believe we’re poised to see a Cambrian explosion of new ways to interact online, and there’s evidence to suggest that it’s already happening: just today, I saw three new apps to share what you’re reading and watching with friends, each with their own unique take on the subject!

In this light, LLMs may be a killer app for decentralized networks — and decentralized networks may be the missing constraint that makes LLM integrations safer, more legible, and more aligned with user interests. It’s a symbiosis, and I believe we need both pieces. Rather than trying to integrate LLMs with everything, I think that deliberately bounded, consent-driven integrations will produce better outcomes.

Cook’s framing of LLMs as a “killer app for decentralized networks” is exactly right — and it runs the other way too. Decentralized networks might be the killer app for making AI tools something other than another vector for corporate lock-in, or just another clone of an existing centralized service.

Now, I can already hear the objection, and it’s a fair one: am I really suggesting we escape dependence on giant tech platforms by… becoming dependent on giant AI companies? Companies that have scraped the entire web, that burn massive amounts of energy and water, that are built on the labor of underpaid content moderators, and that seem to want to consolidate power in ways that look an awful lot like the last generation of tech giants?

Yeah, I get it. If the pitch is “use OpenAI to free yourself from Meta,” that’s just switching landlords.

But that’s not actually where this is heading. The trajectory matters more than the current snapshot.

First, if you’re using frontier models through the API or a pro subscription, you have significantly more control than most people realize. Your data generally isn’t feeding back into training. You’re using the model as a tool, not handing over your content to a platform. That’s a meaningfully different relationship than the one you have with social media companies, where you’re feeding them data, and their business model is based on monetizing that data.

But much more importantly, you don’t have to use the frontier models at all. Open source AI is maturing fast — models like Qwen, Kimi, and Mistral can run entirely on certain hardware, no cloud required. They’re behind the frontier models, but only by a bit. Six months to a year, roughly. But for a lot of the “build your own tools” use cases I’m describing, they’re already good enough.

Musician and YouTuber Rick Beato recently showed how easy it was for him to install local models on his own machine, and why he thinks the largest AI companies will eventually be undercut by home AI usage:

I’ve been doing something similar with Ollama hosting a Qwen model locally. It’s slower and less sophisticated. But it works. And I already use different models for different tasks, defaulting to local when I can. As those models improve — and they are improving quickly — the frontier labs become less necessary, not more. If you’re a professional, perhaps you’ll still need them. But if you’re just building something for yourself, it’s less and less necessary.

This is what the “AI is just another Big Tech power grab” critics are missing: the technology is moving toward decentralization, not away from it. That’s unusual. Social media started decentralized and got captured. AI is starting captured and getting more open over time. The economic pressure from open source models is real, and it’s pushing in the right direction. But it’s important we keep things moving that way and not slow down the development of open source LLMs.

On the training data question — which is a legitimate concern whether or not you think training on copyrighted works is fair use — efforts like Common Corpus are building large-scale training sets from public domain and openly licensed materials. Anil Dash has been writing about what “good AI” looks like in practice — AI that’s transparent about its training data, that respects consent, that minimizes externalities rather than ignoring them. There are ways to do this right.

None of this is fully solved yet. But the direction is clear, and the tools to do it responsibly are improving faster than most critics acknowledge.

When you use AI as a tool (rather than letting it use you as the tool), it can give you a kind of superpower to get past the learned helplessness of relying on whatever choices some billionaire or random product manager made for you. You can get past having to mentally compensate for your tools not really working the way you think they should work. Instead, you can just have the internet and your tools work the way you want them to. It’s the most excited I’ve been about the open web since those early days of realizing I could right click, copy and then figure out how to build sliding doors out of frames.

The promise of the open web was colonized by internet giants. But the power of LLMs and agentic coding means we can start to take it back. We can build customized, personal software for ourselves that does what we want. We can connect with communities via open social protocols that allow us to control the relationship rather than a billionaire intermediary. This is what the Resonant Computing Manifesto was all about, and why I’ve argued ATproto is so key to that vision.

But the other part of realizing the manifesto is the LLM side. That made some people scoff early on, but hopefully this piece shows how these things work hand in hand. These agentic AI tools give the power back to you and me.

Thirty years ago, I right-clicked on Derek Powazek’s beautiful website, viewed the source, copied it, messed around with it, and built something new. I didn’t ask anyone’s permission. I didn’t agree to terms of service. I didn’t fit my ideas into someone else’s template. I just built the thing I wanted to build.

Then we gave that away. We traded it for convenience, for reach, for the path of least resistance — and we got walled gardens, manipulated feeds, and the quiet understanding that our tools would never quite work the way we wanted them to, because they weren’t really ours.

Today’s equivalent of right-clicking on Derek’s site is describing what you want to a coding agent, watching it build, telling it what’s wrong, and iterating until it works for you. Different mechanics, same magic. And this time, with open protocols and increasingly open models, we have a shot at keeping it.

Let’s not give it away again.

Filed Under: agentic ai, ai, coding, html, llms, open social, open web, protocols

97 Comments

Expand

Google Built Its Empire Scraping The Web. Now It’s Suing To Stop Others From Scraping Google

from the the-open-web-is-closing dept

Wed, Dec 24th 2025 11:05am - Mike Masnick

Last week, Google filed suit against SerpApi, a scraping company that helps businesses pull data from Google search results. The lawsuit claims SerpApi violated DMCA Section 1201 by circumventing Google’s “technological protection measures” to access search results—and the copyrighted content within them—without permission.

There’s just one problem with this theory: Google built its entire business on scraping the web without asking permission first. And now it wants to use one of the most abused provisions in copyright law to stop others from doing something functionally similar to what made Google a tech giant in the first place.

The lawsuit comes on the heels of Reddit’s equally problematic anti-scraping suit from October—which we called an attack on the open internet. Reddit sued Perplexity and various scraping firms (including SerpApi), claiming they violated 1201 by circumventing… Google’s technological protections. Reddit was mad it had cut a multi-million dollar licensing deal with Google for access to Reddit content, and these firms were routing around both that deal and Google itself to provide similar results to users. The legal theory was bizarre: Reddit didn’t own the copyright on user posts, and the scrapers weren’t even touching Reddit directly—yet Reddit claimed standing to sue based on circumventing someone else’s TPMs.

So now, Google has filed its own, similar lawsuit, going after SerpApi directly, focused on how SerpApi gets around its attempts to block such scraping. Google released a blog post defending this lawsuit:

We filed a suit today against the scraping company SerpApi for circumventing security measures protecting others’ copyrighted content that appears in Google search results. We did this to ask a court to stop SerpApi’s bots and their malicious scraping, which violates the choices of websites and rightsholders about who should have access to their content. This lawsuit follows legal action that other websites have taken against SerpApi and similar scraping companies, and is part of our long track record of affirmative litigation to fight scammers and bad actors on the web.

Google follows industry-standard crawling protocols, and honors websites’ directives over crawling of their content. Stealthy scrapers like SerpApi override those directives and give sites no choice at all. SerpApi uses shady back doors — like cloaking themselves, bombarding websites with massive networks of bots and giving their crawlers fake and constantly changing names — circumventing our security measures to take websites’ content wholesale. This unlawful activity has increased dramatically over the past year.

SerpApi deceptively takes content that Google licenses from others (like images that appear in Knowledge Panels, real-time data in Search features and much more), and then resells it for a fee. In doing so, it willfully disregards the rights and directives of websites and providers whose content appears in Search.

Look, SerpApi’s behavior is sketchy. Spoofing user agents, rotating IPs to look like legitimate users, solving CAPTCHAs programmatically—Google’s complaint paints a picture of a company actively working to evade detection. But the legal theory Google is deploying to stop them threatens something far bigger than one shady scraper.

Google’s entire business is built on scraping as much of the web as possible without first asking permission. The fact that they now want to invoke DMCA 1201—one of the most consistently abused provisions in copyright law—to stop others from scraping them exposes the underlying problem with these licensing-era arguments: they’re attempts to pull up the ladder after you’ve climbed it.

Just from a straight up perception standpoint, it looks bad.

To be clear: this isn’t about defending SerpApi. They appear to be bad actors who built a business on evading detection systems. The problem is that Google chose to go after them using a legal weapon with a long history of collateral damage. When you invoke Section 1201 against web scraping, you’re not just targeting one sketchy company—you’re potentially rewriting the rules for how the entire open web functions. The choice of weapon matters, especially when that weapon has been repeatedly abused to stifle legitimate competition and could now be turned against the very openness that made the modern internet possible.

For many years, we’ve discussed the many, many problems of DMCA Section 1201. It’s the “anti-circumvention” part of the law that says merely any attempt to get around a “technological protection measure” (or even just tell someone else how to get around a technological protection measure) could be deemed to violate the law, even if the TPMs in question were wholly ineffective, and even if the intent in getting around the TPM had nothing to do with copyright infringement.

That has lead to years of abusive practices by companies who would put silly, pointless “TPMs” in place just in order to be able to use the law to limit competition. There were lawsuits over printer ink cartridges and garage door openers, among other things.

Here, Google is saying that it put in place a TPM in January of 2025 called “SearchGuard” (which sounds like an advanced CAPTCHA of some sort) to prevent SerpApi from scraping its search results, but SerpApi figured out a way around it:

When SearchGuard launched in January 2025, it effectively blocked SerpApi from accessing Google’s Search results and the copyrighted content of Google’s partners. But SerpApi immediately began working on a means to circumvent Google’s technological protection measure. SerpApi quickly discovered means to do so and deployed them.

SerpApi’s answer to SearchGuard is to mask the hundreds of millions of automated queries it is sending to Google each day to make them appear as if they are coming from human users. SerpApi’s founder recently described the process as “creating fake browsers using a multitude of IP addresses that Google sees as normal users.”

SerpApi’s fakery takes many forms. For example, when SerpApi submits an automated query to Google and SearchGuard responds with a challenge, SerpApi may misrepresent the device, software, or location from which the query is sent in order to solve the challenge and obtain authorization to submit queries. Additionally or alternatively, SerpApi may solve SearchGuard’s challenge with a “legitimate” request and then syndicate the resulting authorization, that is, share it with unauthorized machines around the world, to enable their “fake browsers” to generate automated queries that appear to Google as authorized. It also uses automated means to bypass CAPTCHAs, another aspect of SearchGuard that tests users to ensure they are humans rather than machines.

Getting around these protections eats up Google’s resources, and sure, that must be annoying for Google. But the real motivation shows up when Google gets to the economics of the situation. Google has started cutting licensing deals with content partners—most notably the multi-million dollar Reddit deal—and now those partners are pissed that SerpApi lets others access similar data without paying anyone:

For Google, SerpApi’s automated scraping not only consumes substantial computing resources without payment, but also disrupts Google’s content partnerships. Google licenses content so that it can enhance the Search results it provides to users and thereby boost its competitive standing. SerpApi undermines Google’s substantial investment in those licenses, making the content available to other services that need not incur similar costs.

SerpApi’s scraping of Google Search results also impacts the rights holders who license content to Google. Without permission or compensation, SerpApi takes their content from Google and widely distributes it for use by third parties. That, in turn, threatens to disrupt Google’s relationship with the rights holders who look to Google to prevent the misappropriation of the content Google displays. At least one Google content partner, Reddit, has already sued SerpApi for its misconduct.

This is where the 1201 theory becomes genuinely dangerous. Google’s argument, if accepted, provides a roadmap for any website operator who wants to lock down their content: slap on a trivial TPM—a CAPTCHA, an IP check, anything—and suddenly you can invoke federal law against anyone who figures out how to get around it, even if their purpose has nothing to do with copyright infringement.

The implications spiral outward quickly. If Google succeeds here, what stops every major website from deciding they want licensing revenue from the largest scrapers? Cloudflare could put bot detection on the huge swath of the internet it serves and demand Google pay up. WordPress could do the same across its massive network. The open web—built on the assumption that published content is publicly accessible for indexing and analysis—becomes a patchwork of licensing requirements, each enforced through 1201 threats.

That doesn’t seem good for the prospects of a continued open web.

Google’s legal theory has another significant problem: the requirement that a TPM must “effectively control” access. Just last week, a court rejected Ziff Davis’s attempt to turn robots.txt into a 1201 violation when OpenAI allegedly ignored its crawling restrictions. The court’s reasoning is directly applicable here:

Robots.txt files instructing web crawlers to refrain from scraping certain content do not “effectively control” access to that content any more than a sign requesting that visitors “keep off the grass” effectively controls access to a lawn. On Ziff Davis’s own telling, robots.txt directives are merely requests and do not effectively control access to copyrighted works. A web crawler need not “appl[y] . . . information, or a process or a treatment,” in order to gain access to web content on pages that include robots.txt directives; it may access the content without taking any affirmative step other than impertinently disregarding the request embodied in the robots.txt files. The FAC therefore fails to allege that robots.txt files are a “technological measure that effectively controls access” to Ziff Davis’s copyrighted works, and the DMCA section 1201(a) claim fails for this reason.

Google will argue SearchGuard is different—it’s more than a polite request, it actively challenges and blocks scrapers. But if SerpApi can routinely bypass it by spoofing browsers and rotating IPs, does it really “effectively control” access? Or is it just a slightly more sophisticated “keep off the grass” sign that determined actors can ignore?

This question matters enormously because it determines whether the statute that was supposed to prevent piracy of CDs and DVDs now also governs every attempt to access publicly-available web pages through automated means.

For decades, we’ve operated under a system where robots.txt represented a voluntary, good-faith approach to web crawling. The major players respected these directives not because they had to, but because maintaining that norm benefited everyone. That system is breaking down, not because of SerpApi, but because of the rise of scrapers focused on LLM training, mixed with other companies wanting to find licensing deals to get a cut of the money flows. Reddit and Google negotiating licensing deals over open web content was a warning sign of all of this, and now it’s spilling out into the courts with questionable 1201 claims.

Both Reddit and Google frame this as protecting the open internet from bad actors. But pulling up the ladder after you’ve climbed it isn’t protection—it’s rent-seeking. Google built an empire on the assumption that publicly accessible web content could be freely scraped and indexed. Now it wants to rewrite the rules… using Hollywood’s favorite tool to block access to information.

The real problem isn’t that Google is fighting back against SerpApi’s evasive tactics. It’s that they chose to fight using a legal weapon that, if successful, fundamentally changes how we understand access to the open web. Section 1201 has already been wildly abused to stifle competition in everything from printer cartridges to garage door openers. Extending it to cover basic web scraping because SerpApi seems sketchy threatens the foundational assumption that published web content is accessible for indexing, research, and analysis.

Google has the resources to solve this problem through better engineering or by raising the actual cost of evasion high enough that SerpApi’s business model fails. Instead, they’ve opted for a legal shortcut that, if it works, will reshape the internet in ways that go far beyond one sketchy scraping company.

The internet is changing, and legitimate questions exist about how web scraping should function in an era of large language models and AI training. But those questions won’t be answered well by stretching copyright law to cover something it was never designed for, and empowering every website operator to demand licensing fees simply by putting up a CAPTCHA.

That’s not protecting the open web. That’s closing it.

Filed Under: 1201, anti-circumvention, circumvention, copyright, dmca 1201, licensing, open web, robots.txt, webcrawling
Companies: google, reddit, serpapi

Tackling The AI Bots That Threaten To Overwhelm The Open Web

(Mis)Uses of Technology

from the overrunning-the-commons dept

Mon, Jul 14th 2025 01:43pm - Glyn Moody

It is a measure of how fast the field of AI has developed in the three years since Walled Culture the book (free digital versions available) was published that the issue of using copyright material for training AI systems, briefly mentioned in the book, has become one of the hottest topics in the copyright world, as numerous posts on this blog attest.

The current situation sees the copyright industry pitted against the generative AI companies. The former wants to limit how copyright material can be used, while the latter want a free for all. But that crude characterization does not mean that the AI companies can be regarded as on the side of the angels when it comes to broadening access to online material. They may want unfettered access for themselves, but it is becoming increasingly clear that as more companies rush to harvest key online resources for AI training purposes, they risk hobbling access for everyone else, and even threaten the very nature of the open Web.

The problem is particularly acute for non-commercial sites offering access to material for free, because they tend to be run on a shoestring, and are thus unable to cope easily with the extra demand placed on their servers by AI companies downloading holdings en masse. Even huge sites like the Wikimedia Projects, which describes itself as “the largest collection of open knowledge in the world”, are struggling with the rise of AI bots:

We are observing a significant increase in request volume, with most of this traffic being driven by scraping bots collecting training data for large language models (LLMs) and other use cases. Automated requests for our content have grown exponentially, alongside the broader technology economy, via mechanisms including scraping, APIs, and bulk downloads. This expansion happened largely without sufficient attribution, which is key to drive new users to participate in the movement, and is causing a significant load on the underlying infrastructure that keeps our sites available for everyone.

Specifically:

Since January 2024, we have seen the bandwidth used for downloading multimedia content grow by 50%. This increase is not coming from human readers, but largely from automated programs that scrape the Wikimedia Commons image catalog of openly licensed images to feed images to AI models. Our infrastructure is built to sustain sudden traffic spikes from humans during high-interest events, but the amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs.

A valuable new report from the GLAM-E Lab explores how widespread this problem is in the world of GLAMs – galleries, libraries, archives, and museums. Here’s the main result:

Bots are widespread, although not universal. Of 43 respondents, 39 had experienced a recent increase in traffic. Twenty-seven of the 39 respondents experiencing an increase in traffic attributed it to AI training data bots, with an additional seven believing that bots could be contributing to the traffic.

Although the sites that responded to the survey were generally keen for their holdings to be accessed, there comes a point where AI bots are degrading the service to human visitors. The question then becomes: what can be done about it?

There is already a tried and tested way to block bots, using robots.txt, a tool that “allows websites to signal to bots which parts of the site the bots should not visit. Its most widely adopted use is to indicate which parts of sites should not be indexed by search engines,” as the report explains. However, there is no mechanism for enforcing the robot.txt rules, which often leads to problems:

Respondents reported that robots.txt is being ignored by many (although not necessarily all) AI scraping bots. This was widely viewed as breaking the norms of the internet, and not playing fair online.

Reports of these types of bots ignoring robots.txt are widespread, even beyond respondents. So widespread, in fact, that there are currently a number of efforts to develop new or updated robots.txt-style protocols to specifically govern AI-related bot behavior online.

One solution is to use a firewall to block traffic according to certain rules. For example, to block by IP addresses, by geography, or by particular domains. Another is to offload the task of blocking to a third party. The most popular among survey respondents is Cloudflare:

One [respondent] noted that, although they can still see the bot traffic spikes in their Cloudflare dashboard, since implementing protections, none of those spikes had managed to negatively impact the system. Others appreciated the effectiveness of Cloudflare but worried that an environment of persistent bot traffic would mean they would have to rely on Cloudflare in perpetuity.

And that means paying Cloudflare in perpetuity, which for many non-profit sites is a challenge, as is simply increasing server capability or moving to a cloud-based system – other ways of coping with surges in demand. A radically different approach to tackling AI bots is to move collections behind a login. But for many in the GLAM world, there is a big problem with this kind of shift:

the larger objection to moving works behind a login screen was philosophical. Respondents expressed concern that moving work behind a login screen, even if creating an account was free, ran counter to their collection’s mission to make their collections broadly available online. Their goal was to create an accessible collection, and adding barriers made that collection less available.

More generally, this would be a terrible move for the open Web, which has at its heart the frictionless access to knowledge. Locking things down simply to keep out the AI bots would go against that core philosophy completely. It would also bolster arguments frequently made by the copyright industry that access to everything online should by default require permission.

It seems unfair that groups working for the common good are forced by the onslaught of AI bots to carry out extra work constantly re-configuring firewalls, to pay for extra services, or to undermine the openness that lies at the heart of their missions. An article on the University of North Carolina Web site discussing how the university’s library tackled this problem of AI bots describes an interesting alternative approach that could offer a general solution. Faced with a changing pattern of access by huge numbers of AI bots, the library brought in local tech experts:

[Associate University Librarian for Digital Strategies & Information Technology] Shearer turned to the University’s Information Technology Services, which serves the entire campus. They had never encountered an attack quite like this either, and they readily brought their security and networking teams to the table. By mid-January a powerful AI-based firewall was in place, blocking the bots while permitting legitimate searches.

Stopping just the AI bots requires spotting patterns in access traffic that distinguishes them from human visitors in order to allow the latter to continue with their visits unimpeded. Finding patterns quickly in large quantities of data is something that modern AI is good at, so using it to filter out the constantly shifting patterns of AI bot access by tweaking the site’s firewall rules in real time is an effective solution. It’s also an apt one: it means that the problems that AI is creating can be solved by AI itself.

Such an AI-driven firewall management system needs to be created and updated to keep ahead of the rapidly-evolving AI bot landscape. It would make a great open source project that coders and non-profits around the world could work on together, since the latter face a common problem, and many have too few resources to do it on their own. Open source applications of the latest AI technologies are rather thin on the ground, even if most generative AI systems are based on open source code. An AI-driven firewall management system optimized for the GLAM sector would be a great place for the free software world to start remedying that.

Follow me @glynmoody on Mastodon and on Bluesky. Republished from Walled Culture.

Filed Under: ai, bots, filters, firewalls, open web

21 Comments

Expand

AI Crawlers Are Harming Wikimedia, Bringing Open Source Sites To Their Knees, And Putting The Open Web At Risk

(Mis)Uses of Technology

from the externalizing-your-costs-directly-into-my-face dept

Thu, Apr 10th 2025 01:02pm - Glyn Moody

The current rapid advances in generative AI are built on three things. Computing power, some clever coding, and vast amounts of training data. Lots of money can buy you more of the first two, but finding the necessary training material is increasingly hard. Anyone seeking to bolster their competitive advantage through training needs to find fresh sources. This has led to the widespread deployment of AI crawlers, which scour the Internet for more data that can be downloaded and used to train AI systems. Some of the prime targets for these AI scraping bots are Wikimedia projects, which claim to be “the largest collection of open knowledge in the world”. This has now become a serious problem for them:

We are observing a significant increase in request volume, with most of this traffic being driven by scraping bots collecting training data for large language models (LLMs) and other use cases. Automated requests for our content have grown exponentially, alongside the broader technology economy, via mechanisms including scraping, APIs, and bulk downloads. This expansion happened largely without sufficient attribution, which is key to drive new users to participate in the movement, and is causing a significant load on the underlying infrastructure that keeps our sites available for everyone.

Specifically:

Since January 2024, we have seen the bandwidth used for downloading multimedia content grow by 50%. This increase is not coming from human readers, but largely from automated programs that scrape the Wikimedia Commons image catalog of openly licensed images to feed images to AI models. Our infrastructure is built to sustain sudden traffic spikes from humans during high-interest events, but the amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs.

AI crawlers seek to download as much material as possible, including the most obscure, so Wikimedia projects that are optimized for human use incur extra costs:

While human readers tend to focus on specific – often similar – topics, crawler bots tend to “bulk read” larger numbers of pages and visit also the less popular pages. This means these types of requests are more likely to get forwarded to the core datacenter, which makes it much more expensive in terms of consumption of our resources.

Wikimedia’s analysis shows that 65% of this resource-consuming traffic is coming from bots, whereas the overall pageviews from bots are about 35% of the total. As the Diff news story notes, this is becoming a widespread problem not just for Wikimedia, but across the Internet. Some companies are responding with lawsuits, but for another important class of sites this is not a practical option.

These are the open source projects that have a Web presence with a wide range of resources. Many of them are struggling under the impact of aggressive AI crawlers, as a post by Niccolò Venerandi on the LibreNews site details. For example, Drew Devault, the founder of the open source development platform SourceHut, wrote a blog post last month with the title “Please stop externalizing your costs directly into my face”, in which he lamented:

These bots crawl everything they can find, robots.txt be damned, including expensive endpoints like git blame, every page of every git log, and every commit in every repo, and they do so using random User-Agents that overlap with end-users and come from tens of thousands of IP addresses – mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure – actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.

Devault says that he knows many other Web sites are similarly affected:

All of my sysadmin friends are dealing with the same problems. I was asking one of them for feedback on a draft of this article and our discussion was interrupted to go deal with a new wave of LLM bots on their own server. Every time I sit down for beers or dinner or to socialize with my sysadmin friends it’s not long before we’re complaining about the bots and asking if the other has cracked the code to getting rid of them once and for all. The desperation in these conversations is palpable.

The LibreNews article discusses some of the technical approaches to excluding these AI crawlers. But setting them up, monitoring and fine-tuning them requires time and energy from those running the sites — time that could have been spent more fruitfully on managing the actual projects. Similar, the unexpected extra bandwidth costs caused by massive bot downloads come out of the small and often stretched budgets of open source projects. There is a clear danger that these LLM bots will cause open source projects to struggle, and possibly shut down completely.

An article in MIT Technology Review by Shayne Longpre warns that publishers may respond to this challenge in another way, by blocking all crawlers unless they are licensed. That may solve the problem for those sites, and allow deep-pocketed AI companies to train their systems on the licensed material, but many others will lose out:

Crawlers from academic researchers, journalists, and non-AI applications may increasingly be denied open access. Unless we can nurture an ecosystem with different rules for different data uses, we may end up with strict borders across the web, exacting a price on openness and transparency.

It’s increasingly clear that the reckless and selfish way in which AI crawlers are being deployed by companies eager to tap into today’s AI hype is bringing many sites around the Internet to their knees. As a result, AI crawlers are beginning to threaten the open Web itself, and thus the frictionless access to knowledge that it has provided to general users for the last 30 years.

Follow me @glynmoody on Mastodon and on Bluesky.

Filed Under: access to knowledge, ai, apis, bandwidth, bots, datacenter, drew devault, licensing, llms, open source, open web, publishers, scraping, sysadmins, training data, web crawlers, wikimedia
Companies: sourcehut

24 Comments

Expand

Decentralized Systems Will Be Necessary To Stop Google From Putting The Web Into Managed Decline

Predictions

from the it's-up-to-us dept

Tue, May 21st 2024 09:34am - Mike Masnick

Is Google signaling the end of the open web? That’s some of the concern raised by its new embrace of AI. While most of the fears about AI may be overblown, this one could be legit. But it doesn’t mean that we need to accept it.

These days, there is certainly a lot of hype and nonsense about artificial intelligence and the ways that it can impact all kinds of industries and businesses. Last week at Google IO, Google made it clear that they’re moving forward with what it calls “AI overviews,” in which Google’s own Gemini AI tech will try to generate answers at the top of search pages.

All week I’ve been hearing people fretting about this, sharing some statement similar to Kevin Roose at the NY Times asking if the open web can survive such a thing.

In the early days, Google’s entire mission was to get you off their site as quickly as possible. In a 2004 interview with Playboy magazine that was later immortalized in a regulatory filing with the SEC (due to concerns of them violating quiet period restrictions), Larry Page famously made clear that their goal was to quickly help you find what you want and send you on your way:

PLAYBOY: With the addition of e-mail, Froogle—your new shopping site—and Google news, plus your search engine, will Google become a portal similar to Yahoo, AOL or MSN? Many Internet companies were founded as portals. It was assumed that the more services you provided, the longer people would stay on your website and the more revenue you could generate from advertising and pay services.

PAGE: We built a business on the opposite message. We want you to come to Google and quickly find what you want. Then we’re happy to send you to the other sites. In fact, that’s the point. The portal strategy tries to own all of the information.

PLAYBOY: Portals attempt to create what they call sticky content to keep a user as long as possible.

PAGE: That’s the problem. Most portals show their own content above content elsewhere on the web. We feel that’s a conflict of interest, analogous to taking money for search results. Their search engine doesn’t necessarily provide the best results; it provides the portal’s results. Google conscientiously tries to stay away from that. We want to get you out of Google and to the right place as fast as possible. It’s a very different model.

PLAYBOY: Until you launched news, Gmail, Froogle and similar services.

PAGE: These are just other technologies to help you use the web. They’re an alternative, hopefully a good one. But we continue to point users to the best websites and try to do whatever is in their best interest. With news, we’re not buying information and then pointing users to information we own. We collect many news sources, list them and point the user to other websites. Gmail is just a good mail program with lots of storage.

Ah, how times have changed. And, of course, there is an argument that if you’re just looking for an answer to a question, giving you that answer directly can and should be more efficient, rather than pointing you to a list of places that might (or might not) have that answer.

But, not everything that people are searching for is just “an answer.” And not everything that is an answer takes into account the details, nuances, and complexities of whatever topic someone might be searching on.

There’s nothing inherent to the internet that makes the “search to get linked somewhere else” model have to make sense. Historically, that’s how things have been done. But if you could have an automated system simply give you directly what you needed at the right time, that would probably be a better solution for some subset of issues. And, if Google doesn’t do it, someone else will, and that would undermine Google’s market.

But still, it sucks.

Google’s search has increasingly become terrible. And it appears that much of that enshittification is due to (what else?) an effort to squeeze more money out of everyone, rather than providing a better service.

In Casey Newton’s writeup of the new “AI Overviews” feature, he notes that it may be a sign that “the web as we know it is entering a kind of managed decline.”

Still, as the first day of I/O wound down, it was hard to escape the feeling that the web as we know it is entering a kind of managed decline. Over the past two and a half decades, Google extended itself into so many different parts of the web that it became synonymous with it. And now that LLMs promise to let users understand all that the web contains in real time, Google at last has what it needs to finish the job: replacing the web, in so many of the ways that matter, with itself.

I had actually read this article the day it came out, but I didn’t think too much of that paragraph until a couple days later at a dinner full of folks working on decentralization. Someone brought up that quote, though paraphrased it slightly differently, claiming Casey was saying that Google was actively putting the web into managed decline.

Whether or not that’s very different (and maybe it’s not), both should spark people to realize that this is a problem.

And it’s one of the reasons I am still hoping that people will spend more time thinking about solutions that involve decentralization. Not necessarily because of “search” (which tends to be more of a centralized tool by necessity), but because the world of decentralized social media could offer an alternative to the world in which all the information we consume is intermediated by a single centralized player, whether it’s a search engine like Google, or a social media service like Meta.

For the last few years, there have been stories trying to remind people that Facebook is not the internet. But that’s because, for some people, it kinda has been. And the same is true of Google. For some people, their online worlds exist either in social media or in search as the mediating forces in their lives. And, obviously, there are all sorts of reasons why that happens, but it should be seen as a much less fulfilling kind of internet.

The situation discussed here, where Google is trying to give people full answers via AI, rather than sending them elsewhere on the web, may well be “putting the web into managed decline,” but there’s no reason we have to accept that future.

The various decentralized social media systems that have been growing over the past few years offer a very different potential approach: one in which you get to build the experience you want, rather than the one a giant company wants. If you need information, others on the decentralized social network can help you find it or respond to your questions.

It’s a much more social experience, mediated by other people, perhaps on different systems, rather than a single giant company determining what you get to see.

The promise of the internet, and the World Wide Web in particular, was that anyone could build their own world there, connected with others. It was a world that wasn’t supposed to be in any kind of walled garden. But, many people have ended up in just a few of those walled gardens.

It’s no secret why: they do what they do pretty damn well, and certainly better than what was around before. People became reliant on Google search because it was much better. They became reliant on Facebook because it was an easy way to keep up with your family and friends. But in giving those companies so much control, we’ve lost some of that promise of the open web.

And now we can take it back. Whether it’s using ActivityPub/Mastodon, or Bluesky/ATProtocol (or others like nostr or Farcaster), we’re starting to see users building out an alternative vision that isn’t just mediated by single companies with Wall Street demands pushing them to enshittify.

No one’s saying to give up using Google, because it’s necessary for many. But start to think about where you spend your time online, and who is looking to lock you in vs. who is giving you more freedom to have the world that works best for you.

Filed Under: ai, decentralization, managed decline, open web, search
Companies: google

31 Comments

Expand

Meta Begins The Process Of Ending News Links In Canada

Journalism

from the the-end-of-the-news dept

Wed, Aug 2nd 2023 12:08pm - Mike Masnick

This is not a surprise, because the company made it clear it planned to do exactly this, but Meta has now begun the process of stopping links to news sources from appearing in Canada, something that Canadian Heritage Minister Pablo Rodriguez insisted would never happen. The company says it will take a few weeks to roll out fully, but in the meantime, Meta explains what this will actually look like.

For Canadian news outlets this means:

News links and content posted by news publishers and broadcasters in Canada will no longer be viewable by people in Canada. We are identifying news outlets based on legislative definitions and guidance from the Online News Act.

For international news outlets this means:

News publishers and broadcasters outside of Canada will continue to be able to post news links and content, however, that content will not be viewable by people in Canada.

For our Canadian community this means:

People in Canada will no longer be able to view or share news content on Facebook and Instagram, including news articles and audio-visual content posted by news outlets.

For our international community this means:

There is no change to our services for people accessing our technologies outside of Canada.

The details mention Facebook and Instagram, though it’s not clear if Threads is included as well. Perhaps as a subset of Instagram it is, but that also might damage Threads viability even more.

This is disappointing in all sorts of ways. Not being able to post, view, or discuss news is not a great result, obviously. I especially feel bad for the media orgs who bet big on Facebook as a delivery channel, who are hurt by this (as many people know, Techdirt basically ignored Facebook other than setting up an auto-posting system, and while others mocked us for this decision, in the long run, I still stand by it).

But the blame for this disappointing result needs to go fully on the Canadian government. This law is bad. The entire structure of it is an attack on the open web, suggesting that governments can force some companies to pay other companies for sending them traffic. That makes no sense in any world.

Throughout this process, the media orgs that supported this bill, and the politicians behind it as well, have vastly (embarrassingly) overestimated the importance and value of news to Facebook and Google. Even in what they’ve talked about, suggesting that these companies were “profiting unfairly” off of news, just never made any sense if you had any idea how any of this actually works. Google and Facebook make very little money off of news links. At best, they served as a way to get some users to spend a bit more time coming to their platforms as part of their feed, but it was never a central part, nor particularly valuable.

I did, however, want to respond to a few comments (often screamed at me on Twitter) directed at me regarding my opposition to these laws. There’s this weird, dangerous, belief that because these laws “tax” Facebook and Google and lots of people (reasonably!) dislike Facebook and Google, so they must be good laws. And, relatedly, they claim that anyone who doesn’t support these laws, must be doing so in support of Facebook or Google.

But, that’s both silly and shortsighted. I’d be happy to see both Meta and Google cut down to size, and have said so for years, and have even suggested many ways of making that a reality. But these kinds of laws are dangerous, on principle, in taxing something that makes no sense to tax, forcing payments for something that should be fundamentally free, and undermining the basic structure of the open web.

But, worse, they represent an acceptance of a fundamentally corrupt principle that will undoubtedly be abused to much greater lengths going forward.

In establishing the principle that the government can look at one industry and force another industry to pay it, is a recipe for very dangerous corruption. That’s doubly true when, as in this case, we’re talking about one industry that mostly failed to innovate, rested on its cash cow laurels, and spent years mocking the innovation occurring around them. And then going after the industry that did innovate, that built products and services that customers actually used, with better business models, and basically telling them they have to cough up cash for the industry that failed to do that?

That creates incredibly skewed incentives for literally everyone involved. It creates terrible incentives for legacy industries. Terrible incentives for innovative industries. Terrible incentives for politicians. It’s a lose-lose-lose proposition.

Am I concerned about the plight of media today? Absolutely (I mean, for fuck’s sake, I run a media site!). Am I concerned that Google and Meta are too powerful, and prone to abusing that power? Absolutely. That’s why I constantly push for plans that lessen their power and move people to alternative approaches.

But you have to do it in a way that doesn’t fundamentally mess up literally everyone’s incentives in a manner that isn’t just obviously corrupt, but so blatantly so that it diminishes everyone’s trust in our institutions.

Filed Under: c-18, canada, journalism, link tax, linking, news, open web
Companies: meta

24 Comments

Expand

EU Gives Up On The Open Web Experiment, Decides It Will Be The Licensed Web Going Forward

from the this-is-bad dept

Wed, Sep 12th 2018 09:35am - Mike Masnick

Well, this was not entirely unexpected at this point, but in the EU Parliament earlier today, they voted to end the open web and move to a future of a licensed-only web. It is not final yet, as the adopted version by the EU Parliament is different than the (even worse) version that was agreed to by the EU Council. The two will now need to iron out the differences and then there will be a final vote on whatever awful consolidated version they eventually come up with. There will be plenty to say on this in the coming weeks, months and years, but let’s just summarize what has happened.

For nearly two decades, the legacy entertainment industries have always hated the nature of the open web. Their entire business models were based on being gatekeepers, and a “broadcast” world in which everything was licensed and curated was perfect for that. It allowed the gatekeepers to have ultimate control — and with it the power to extract massive rents from actual creators (including taking control over their copyright). The open web changed much of that. By allowing anyone to publicize, distribute and sell works by themselves, directly to end users, the middlemen were no longer important.

The fundamental nature of the internet was that it was a communications medium rather than a broadcast medium, and as such it allowed for permissionless distribution of content and communication. This has always infuriated the legacy gatekeepers as it completely undermined the control and leverage they had over the market. If you look back at nearly every legal move by these gatekeepers over the last twenty five years concerning the internet, it has always been about trying to move the internet away from an open, permissionless system back towards one that was a closed, licensed, broadcast, curated one. There’s historical precedence for this as well. It’s the same thing that happened to radio a century ago.

For the most part, the old gatekeepers have not been able to succeed, but that changed today. The proposal adopted by the EU Parliament makes a major move towards ending the open web in the EU and moving to a licensed, curated one, which will limit innovation, harm creators, and only serve to empower the largest internet platforms and some legacy gatekeepers. As Julia Reda notes:

Today?s decision is a severe blow to the free and open internet. By endorsing new legal and technical limits on what we can post and share online, the European Parliament is putting corporate profits over freedom of speech and abandoning long-standing principles that made the internet what it is today.

The Parliament?s version of Article 13 (366 for, 297 against) seeks to make all but the smallest internet platforms liable for any copyright infringements committed by their users. This law leaves sites and apps no choice but to install error-prone upload filters. Anything we want to publish will need to first be approved by these filters, and perfectly legal content like parodies and memes will be caught in the crosshairs.

The adopted version of Article 11 (393 for, 279 against) allows only ?individual words? of news articles to be reproduced for free, including in hyperlinks ? closely following an existing German law. Five years after the ?link tax? came into force in Germany, no journalist or publisher has made an extra penny, startups in the news sector have had to shut down and courts have yet to clear up the legal uncertainty on exactly where to draw the line. The same quagmire will now repeat at the EU level ? no argument has been made why it wouldn?t, apart from wishful thinking.

This is a dark day for the open internet in the EU… and around the world. Expect the same gatekeepers to use this move by the EU to put pressure on the US and lots of other countries around the world to “harmonize” and adopt similar standards in trade agreements.

I know that many authors, musicians, journalists and other content creators cheered this on, incorrectly thinking that was a blow to Google and would magically benefit them. But they should recognize just what they’ve supported. It is not a bill designed to help creators. It is a bill designed to prevent innovation, lock up paths for content creators to have alternatives, and force them back into the greedy, open arms of giant gatekeepers.

Filed Under: article 11, article 13, broadcast, copyright, eu copyright directive, gatekeepers, internet, license, link tax, open web, permission, upload filters

153 Comments

Expand

Google Ideas Boss's Really Bad Idea: Kick ISIS Off The Open Web

Say That Again

from the good-luck-with-that dept

Thu, Jan 21st 2016 08:33am - Mike Masnick

Over the last few weeks, there’s been increasing focus on what “else” Silicon Valley can do in the fight against ISIS. Backdooring encryption is a dumb idea that won’t work and will make everyone less safe. So, a second idea keeps getting floated: what if we just stopped letting ISIS use the internet. Hell, both Hillary Clinton and Donald Trump supported the idea recently. And then you have some wacky law professors suggesting the same thing.

For the most part, cooler heads in the tech industry have pointed out that (1) this is impossible and (2) any attempt to do so would be counterproductive in just encouraging more activity and (3) it would actually undermine intelligence gathering, as public posts to social media are a key source of useful intelligence these days.

But, now, at least one prominent person within the tech industry has jumped on board: the somewhat controversial head of Google Ideas, Jared Cohen, who used to work for the State Department and now runs Google Ideas (which, for whatever it’s worth, isn’t “Google”). Cohen gave a talk in the UK in which he argued that ISIS was too good at propaganda on the internet, so the answer is to wipe them off the open internet and leave them shuffling around the dark web instead.

Jared Cohen, the director of Google Ideas, believes that to “recapture digital territory” from the terror cell, its members must fear being caught when they post messages promoting the organisation’s cause in public.

“Terrorist groups like Isis, they operate in the dark web whether we want them to or not,” Cohen said at a talk on Waging a Digital Counterinsurgency, at Chatham House. “What is new is that they’re operating without being pushed back in the same internet we all enjoyed. So success looks like Isis being contained to the dark web”.

This is, as noted above, both silly and wrong. First of all, it’s impossible. It’s a ridiculous task that will waste a ton of time, won’t accomplish anything really useful, and will likely result in too many false positives, including (most likely) those who are monitoring and combating ISIS. Second, as mentioned, it will actually do a tremendous amount to limit the intelligence community’s ability to monitor and track ISIS. It’s funny that on the one hand we have officials demanding an end to encrypted communications, fearing “going dark,” while many of those same individuals then turn around and talk about taking ISIS off the public internet, where they reveal a ton of useful information about their activities. Third, it raises serious questions about how committed companies like Google really are to the open internet. Yes, Cohen is director of “Google Ideas” which is separate from Google itself, but basically all of the press coverage about this says that Google is saying people should be kicked off the open web. That’s messaging that will come back to haunt Google as it pushes for the open web in other contexts. Cohen has just opened up Google to a major attack on key points it’s pushing for everywhere else.

On top of that, Cohen seems to think that losing their Twitter accounts will be seen as some kind of punishment:

To do this Cohen said that Isis members openly promoting their cause online must fear retribution and being caught for their actions. Their social media accounts must be removed as fast as they are produced to prevent people making contact with Isis recruiters on the open web.

But that appears to be somewhat ignorant of how things are currently working. Many of their social media accounts are being removed rapidly and to ISIS supporters it becomes a badge of honor, as they quickly open a new account. It’s not retribution, it becomes validation.

It’s too bad that Cohen would suggest such a short-sighted concept when there’s so much evidence these days of how completely counterproductive it would be. This isn’t the kind of creative or new thinking that was promised from Google Ideas, it’s traditional silly Washington DC thinking, without any recognition of the reality of the technology world. If this is a concept from Google Ideas, let’s just say it’s a really, really bad idea. Maybe Google needs a department of better ideas.

Update: And I missed the biggest laugh of all. I hadn’t even realized that the supposed “mission” of Google Ideas is: “Google Ideas builds products to support free expression and access to information for people who need it most.” Hard to see how blocking people from using the internet fits within that purview.

Filed Under: free speech, google ideas, isis, jared cohen, open web, social media
Companies: google, google ideas

32 Comments

Expand

Older Stories >>

Follow Techdirt

Subscribe to Our Newsletter

Essential Reading

The Techdirt Greenhouse

Read the latest posts:

Read All »

Tuesday
09:35	404 Media Scores Leaked Document Detailing The Dozens Of Surveillance Tools ICE Has Access To (6)
05:29	The Trump FCC's Chinese Drone Ban Continues To Be A Sloppy, Protectionist Mess (2)
Monday
20:07	AI Systems Out-Persuade Expert Humans, Including Professional Canvassers And World Championship Debaters (45)
15:10	Trump Officials Want To Use Human Rights Aid To Advocate For White South Africans And Right-Wing Causes In Europe (6)
13:02	Administration Admits It's Canceling Federal Grants For Purely Political Reasons (40)
11:04	Judge Rejects Google's Attempt To DMCA Its Way Out Of Being Scraped (5)
10:59	Daily Deal: ExamsDigest for CompTIA, AWS, Microsoft & More (0)
09:41	Protester Charged With Destroying Property After Giving CBP Agents His GrapheneOS 'Duress' Mode Passcode (35)
05:34	Court Partially Reverses Trump Attacks On Law That Tried To Make Sure Broadband Deployment Isn't Racist (6)
Sunday
12:00	Funniest/Most Insightful Comments Of The Week At Techdirt (9)
Saturday
12:00	This Week In Techdirt History: July 19th - 25th (1)
Friday
19:39	Trump Fires Court-Appointed US Attorney One Hour After Appointment, Immediately Gets Sued (35)
15:14	DOJ Withdraws NY Times Subpoenas After Judge Notices It Never Bothered To Follow The Rules For Subpoenaing Reporters (16)
13:07	Ctrl-Alt-Speech: Live At TrustCon 2026 (1)
11:09	Election Commission Says Musk Likely Broke The Law By Paying Voters. Will Anyone Do Anything About It? (18)
11:04	Daily Deal: The Complete Raspberry Pi And Alexa A-Z Bundle (0)
09:25	Administration Accelerating Immigration Hearings To Ensure Migrants Miss New Court Dates (18)
05:22	Brendan Carr Lobs More Empty Threats At ABC For Not Airing Trump's Election Fraud Lies (8)
Thursday
20:05	Caleb Williams, George Gervin, An 'Iceman' Trademark And Insulated Boots...Oh My? (9)
15:40	“Digital Colonialism”: U.S. Demands To Access Africans’ Data Raise Privacy, Sovereignty Concerns (18)
13:20	The FTC’s National Nanny Returns: AI Edition (3)
11:19	Government Lawyers Say Trump Admin Can Use TikTok Again Because 'Owned' No Longer Means 'Owned' (17)
11:14	Daily Deal: The Modern No-Code Development Bundle (0)
09:34	ICE Illegally Scooped Up Medicaid Data, Then Shared It With Palantir (13)
05:27	Dem Texas AG Candidate Vows To Investigate Musk's Starlink Grant Grift (8)
Wednesday
20:16	Screen Time Guidelines For Kids Is Changing As Research Paints A More Nuanced Picture (13)
15:00	Ctrl-Alt-Speech Spotlight: PwC’s Dan Hays On The Future Of Trust & Safety (0)
13:22	Careful What You Sue For: Trump's BBC Case Just Forced His Financial Records Into Discovery (13)
11:13	DOJ Now Citing Fake AI-Generated Cases To Keep ICE Detainees Locked Up (11)
11:08	Daily Deal: The Ultimate Microsoft Office 2021/365 Training Bundle (0)

The Free And Open Web Is Under Attack At The IETF

from the the-open-web-includes-the-ability-to-scrape dept

Why Google’s New AI-Saturated Search Page Will Be A Disaster

from the the-end-of-ten-blue-links dept

AI Might Be Our Best Shot At Taking Back The Open Web

from the hear-me-out dept

Google Built Its Empire Scraping The Web. Now It’s Suing To Stop Others From Scraping Google

from the the-open-web-is-closing dept

Tackling The AI Bots That Threaten To Overwhelm The Open Web

from the overrunning-the-commons dept

AI Crawlers Are Harming Wikimedia, Bringing Open Source Sites To Their Knees, And Putting The Open Web At Risk

from the externalizing-your-costs-directly-into-my-face dept

Decentralized Systems Will Be Necessary To Stop Google From Putting The Web Into Managed Decline

from the it's-up-to-us dept

Meta Begins The Process Of Ending News Links In Canada

from the the-end-of-the-news dept

For Canadian news outlets this means:

For international news outlets this means:

For our Canadian community this means:

For our international community this means:

EU Gives Up On The Open Web Experiment, Decides It Will Be The Licensed Web Going Forward

from the this-is-bad dept

Google Ideas Boss's Really Bad Idea: Kick ISIS Off The Open Web

from the good-luck-with-that dept

Get all our posts in your inbox with the Techdirt Daily Newsletter!

The Techdirt Greenhouse

Trending Posts

Tuesday

Monday

Sunday

Saturday

Friday

Thursday

Wednesday

More

Tools & Services

Company

Contact

More

from the the-open-web-includes-the-ability-to-scrape dept

from the the-end-of-ten-blue-links dept

from the hear-me-out dept

from the the-open-web-is-closing dept

from the overrunning-the-commons dept

from the externalizing-your-costs-directly-into-my-face dept

from the it's-up-to-us dept

from the the-end-of-the-news dept

For Canadian news outlets this means:

For international news outlets this means:

For our Canadian community this means:

For our international community this means:

from the this-is-bad dept

from the good-luck-with-that dept

Techdirt Daily Newsletter

Get all our posts in your inbox with the Techdirt Daily Newsletter!

The Techdirt Greenhouse

Trending Posts

Email This Story

Tools & Services

Company

Contact

More