California’s A.B. 412: A Bill That Could Crush Startups And Cement A Big Tech AI Monopoly
from the this-is-not-the-solution-you're-looking-for dept
California legislators have begun debating a bill (A.B. 412) that would require AI developers to track and disclose every registered copyrighted work used in AI training. At first glance, this might sound like a reasonable step toward transparency. But it’s an impossible standard that could crush small AI startups and developers while giving big tech firms even more power.
A Burden That Small Developers Can’t Bear
The AI landscape is in danger of being dominated by large companies with deep pockets. These big names are in the news almost daily. But they’re far from the only ones – there are dozens of AI companies with fewer than 10 employees trying to build something new in a particular niche.
This bill demands that creators of any AI model—even a two-person company or a hobbyist tinkering with a small software build—identify copyrighted materials used in training. That requirement will be incredibly onerous, even if limited just to works registered with the U.S. Copyright Office. The registration system is a cumbersome beast at best—neither machine-readable nor accessible, it’s more like a card catalog than a database—that doesn’t offer information sufficient to identify all authors of a work, much less help developers to reliably match works in a training set to works in the system.
Even for major tech companies, meeting these new obligations would be a daunting task. For a small startup, throwing on such an impossible requirement could be a death sentence. If A.B. 412 becomes law, these smaller players will be forced to devote scarce resources to an unworkable compliance regime instead of focusing on development and innovation. The risk of lawsuits—potentially from copyright trolls—would discourage new startups from even attempting to enter the field.
A.I. Training Is Like Reading And It’s Very Likely Fair Use
A.B. 412 starts from a premise that’s both untrue and harmful to the public interest: that reading, scraping or searching of open web content shouldn’t be allowed without payment. In reality, courts should, and we believe will, find that the great majority of this activity is fair use.
It’s now bedrock internet law principle that some forms of copying content online are transformative, and thus legal fair use. That includes reproducing thumbnail images for image search, or snippets of text to search books.
The U.S. copyright system is meant to balance innovation with creator rights, and courts are still working through how copyright applies to AI training. In most of the AI cases, courts have yet to consider—let alone decide—how fair use applies. A.B. 412 jumps the gun, preempting this process and imposing a vague, overly broad standard that will do more harm than good.
Importantly, those key court cases are all federal. The U.S. Constitution makes it clear that copyright is governed by federal law, and A.B. 412 improperly attempts to impose state-level copyright regulations on an issue still in flux.
A.B. 412 Is A Gift to Big Tech
The irony of A.B. 412 is that it won’t stop AI development—it will simply consolidate it in the hands of the largest corporations. Big tech firms already have the resources to navigate complex legal and regulatory environments, and they can afford to comply (or at least appear to comply) with A.B. 412’s burdensome requirements. Small developers, on the other hand, will either be forced out of the market or driven into partnerships where they lose their independence. The result will be less competition, fewer innovations, and a tech landscape even more dominated by a handful of massive companies.
If lawmakers are able to iron out some of the practical problems with A.B. 412 and pass some version of it, they may be able to force programmers to research—and effectively, pay off—copyright owners before they even write a line of code. If that’s the outcome in California, Big Tech will not despair. They’ll celebrate. Only a few companies own large content libraries or can afford to license enough material to build a deep learning model. The possibilities for startups and small programmers will be so meager, and competition will be so limited, that profits for big incumbent companies will be locked in for a generation.
If you are a California resident and want to speak out about A.B. 412, you can find and contact your legislators through this website.
Originally published to the EFF’s Deeplinks blog.
Filed Under: ab 412, ai, ai training data, california, competition, copyright, fair use


Comments on “California’s A.B. 412: A Bill That Could Crush Startups And Cement A Big Tech AI Monopoly”
Another day, another impossible-to-comply tech regulation bill. Yikes, we really need more tech-literate politicians.
This comment has been flagged by the community. Click here to show it.
“Won’t somebody please think of the small developers who want to shove AI slop into your faces whether you like it or not?” – EFF
The EFF going to bat for AI is just as embarrassing as when they shilled for NFTs and Cryptocurrency.
Re:
Oh shut your trap, the GOP are dismantling the government and nanny-state and anti-porn lawmakers are coming together to kill and censor the internet.
Stop attacking the people fighting for you, dipshit.
Re: Re:
Crypto companies donated hundreds of millions to fund what is happening to the US government now because they stand to benefit in the short term from ending all investigations into their illegal activities the unraveling of the rule of law. If you’re fighting for them and (often the same) people looking to loot the open internet and continually hammer sites for content regardless of what the users and owners of sites want, actively ignoring robots.txt to scrape every last scrap of data which they will then claim ownership of when it’s convenient, regardless of the costs to the actual little guy, you sure as shit aren’t fighting for them.
Copyright owners being assholes doesn’t make AI companies good guys, just the same as banks being assholes doesn’t make the people running Coinbase and Tether in any way good.
AB 412
Isn’t this a good argument that the nation should provide a content and copyright database that can be accessed? Large companies can select all the content and pay compensation, and small companies can filter for uncopyrighted/license-free content, perhaps with data/tags about the content details.
Obviously, this will not work if the system can be gamed if it only applies to companies resident in a state therefore it should be built at the federal level. This would be public infrastructure paid out of federal taxes that would allow all companies engaged in AI to use content for training at the amount they can afford. Uncertainty about liability would be [largely] solved. I appreciate there would be grifters trying to subvert the system like we see in patents, but at least small companies would be sure that they are using the content they want without being potentially liable. For teh creators, there would be a database that could be checked for content presence and use.
The hard problem will be voice content as I suspect companies like Telcos and Amazon will be taking voice content provided by devices without opt-in/opt-out options. The upcoming AMZN Alexa upgrade seems to be indicating that they want to do exactly that – better features for the unlimited, uncompensated use of your voice.
Re:
Yep. It levels the playing field without hurting creators, and the government is well suited towards this sort of public good.
If anything, it would be a better solution, since it also levels the disadvantages of needing to invest capital into scraping everything yourself, which is not cheap regardless of regulation (never mind the spam sites are facing with multiple companies hammering their sites).
It also potentially paves the way to solve future problems. A lot of companies are running into an issue where they’ve already scraped everything on the internet, and they need synthetic data. As well as the ability to charge/tax companies based on their size/profits.
Re: Re:
🤣🤣🤣
Re: Re: Re:
“The government” in the abstract, not this particular administration.
Re:
The hard problem will be voice content as I suspect companies like Telcos and Amazon will be taking voice content provided by devices without opt-in/opt-out options.
That’s not a hard problem at all in this context, to the extent copyright on such voice content exists it is owned by the person who fixed it in a tangible medium. That is, Amazon. Speaking does not create any copyright interest, recording does.
You may possibly consider it a hard problem from a privacy standpoint (though I’d argue that is also an extremely easy problem), but that’s an entirely different area.
It makes life difficult for some Bullshit Generator vendors? Sounds good to me. We can hunt the big game later.
Re:
No. It locks in internet giants and harms startups.
I recognize there’s a very silly class of people who think all LLM related products are junk, but that’s nonsense.
Re: Re:
Giving internet giants what they actively are pushing for is a real weird way to go about fixing that, if that’s your concern.
Re:
Since meme generators (among other kinds) are actually fun, I have to wonder if the next thing you’re going to attack is children’s parties.
Way to help Foreign AI startups innovate
If the innovative startups can’t do it here then they’ll do it elsewhere.
OK, I will agree with a “regulation will lock in the major players” take in many areas, but here it’s like saying California is going to prohibit artisanal, hand-sliced subprime mortgages.
This is what I was talking about last month:
“When the dust clears, there will probably be some kind of compromise where large corporations that are afraid of AI will be paid off or partnered up with AI companies. This might result from lawsuits or mergers or buyouts. If courts rule against LLM training on copyrighted content, you’ll just see investment war chests opened up for licensing from large copyright holders. The little guy will still get screwed, but some greedy publisher will take a payoff.”
dont call it AI tho
I think we should be careful to distinguish between AI and LLM. DeepBlue was an AI and afaik didnt need copyrighted input. ChatGPT otoh is just an LLM, not an AI.
Re:
Depends on how you define “copyrighted input.”
The underlying collection of facts within chess databases are not themselves able to be copyrighted, so in that sense no it didn’t require copyrighted input.
On the other hand, creating such a database requires obtaining video or written records of all the chess games in question, which are automatically copyrighted like everything else, and therefore it requires copyrighted inputs.
Re: Re: thats not how they did it
Most of deep blue’s database was self created by running billions of chess games against itself.
So... Time to Relocate?
So don’t locate your business in California?
No. It’s really not. The constitution says it’s ALL about innovation. The constitution permits a long term game where short term innovation is, slightly, depressed… but in exchange for … much more innovation[0]. The US constitution does NOT recognize an inherent “creator rights” in any way shape or form. It ALLOWS the issuance of extra, time limited privileges (rights) as an incentive for more creativity… but that’s it.
It’s right there at the start of the sentence. The constitution is interested in innovation, and allows the usage of creator rights to achieve that ends… but it doesn’t require them. The constitution places zero inherent value or merit in creator rights
[0] It’s worth noting that the current implementation of US copyright is inconsistent with what the constitution intends or allows. But even the current system still is not a “balance”
This comment has been flagged by the community. Click here to show it.
Banning the home production of meth is a gift to Big Pharma!
Re:
Wow. Deeply insightful analogy.
The first half (documenting what the training data is) doesn’t seem too bad or hard to do (storage space getting cheaper all the time). It’s the half not mentioned directly in the article that’s problematic, specifically the part where the AI owner has to tell any given rando what data of theirs was used