The current rapid advances in generative AI are built on three things. Computing power, some clever coding, and vast amounts of training data. Lots of money can buy you more of the first two, but finding the necessary training material is increasingly hard. Anyone seeking to bolster their competitive advantage through training needs to find fresh sources. This has led to the widespread deployment of AI crawlers, which scour the Internet for more data that can be downloaded and used to train AI systems. Some of the prime targets for these AI scraping bots are Wikimedia projects, which claim to be “the largest collection of open knowledge in the world”. This has now become a serious problem for them:
We are observing a significant increase in request volume, with most of this traffic being driven by scraping bots collecting training data for large language models (LLMs) and other use cases. Automated requests for our content have grown exponentially, alongside the broader technology economy, via mechanisms including scraping, APIs, and bulk downloads. This expansion happened largely without sufficient attribution, which is key to drive new users to participate in the movement, and is causing a significant load on the underlying infrastructure that keeps our sites available for everyone.
Specifically:
Since January 2024, we have seen the bandwidth used for downloading multimedia content grow by 50%. This increase is not coming from human readers, but largely from automated programs that scrape the Wikimedia Commons image catalog of openly licensed images to feed images to AI models. Our infrastructure is built to sustain sudden traffic spikes from humans during high-interest events, but the amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs.
AI crawlers seek to download as much material as possible, including the most obscure, so Wikimedia projects that are optimized for human use incur extra costs:
While human readers tend to focus on specific – often similar – topics, crawler bots tend to “bulk read” larger numbers of pages and visit also the less popular pages. This means these types of requests are more likely to get forwarded to the core datacenter, which makes it much more expensive in terms of consumption of our resources.
Wikimedia’s analysis shows that 65% of this resource-consuming traffic is coming from bots, whereas the overall pageviews from bots are about 35% of the total. As the Diff news story notes, this is becoming a widespread problem not just for Wikimedia, but across the Internet. Some companies are responding with lawsuits, but for another important class of sites this is not a practical option.
These bots crawl everything they can find, robots.txt be damned, including expensive endpoints like git blame, every page of every git log, and every commit in every repo, and they do so using random User-Agents that overlap with end-users and come from tens of thousands of IP addresses – mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure – actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.
Devault says that he knows many other Web sites are similarly affected:
All of my sysadmin friends are dealing with the same problems. I was asking one of them for feedback on a draft of this article and our discussion was interrupted to go deal with a new wave of LLM bots on their own server. Every time I sit down for beers or dinner or to socialize with my sysadmin friends it’s not long before we’re complaining about the bots and asking if the other has cracked the code to getting rid of them once and for all. The desperation in these conversations is palpable.
The LibreNews article discusses some of the technical approaches to excluding these AI crawlers. But setting them up, monitoring and fine-tuning them requires time and energy from those running the sites — time that could have been spent more fruitfully on managing the actual projects. Similar, the unexpected extra bandwidth costs caused by massive bot downloads come out of the small and often stretched budgets of open source projects. There is a clear danger that these LLM bots will cause open source projects to struggle, and possibly shut down completely.
An article in MIT Technology Review by Shayne Longpre warns that publishers may respond to this challenge in another way, by blocking all crawlers unless they are licensed. That may solve the problem for those sites, and allow deep-pocketed AI companies to train their systems on the licensed material, but many others will lose out:
Crawlers from academic researchers, journalists, and non-AI applications may increasingly be denied open access. Unless we can nurture an ecosystem with different rules for different data uses, we may end up with strict borders across the web, exacting a price on openness and transparency.
It’s increasingly clear that the reckless and selfish way in which AI crawlers are being deployed by companies eager to tap into today’s AI hype is bringing many sites around the Internet to their knees. As a result, AI crawlers are beginning to threaten the open Web itself, and thus the frictionless access to knowledge that it has provided to general users for the last 30 years.
One of the many interesting aspects of the current enthusiasm for generative AI is the way that it has electrified the formerly rather sleepy world of copyright. Where before publishers thought they had successfully locked down more or less everything digital with copyright, they now find themselves confronted with deep-pocketed companies – both established ones like Google and Microsoft, and newer ones like OpenAI – that want to overturn the previous norms of using copyright material. In particular, the latter group want to train their AI systems on huge quantities of text, images, videos and sounds.
As Walled Culture has reported, this has led to a spate of lawsuits from the copyright world, desperate to retain their control over digital material. They have framed this as an act of solidarity with the poor exploited creators. It’s a shrewd move, and one that seems to be gaining traction. Lots of writers and artists think they are being robbed of something by Big AI, even though that view is based on a misunderstanding of how generative AI works. However, in the light of stories like one in The Bookseller, they might want to reconsider their views about who exactly is being evil here:
Academic publisher Wiley has revealed it is set to make $44 million (£33 million) from Artificial Intelligence (AI) partnerships that it is not giving authors the opportunity to opt-out from.
As to whether authors would share in that bounty:
A spokesperson confirmed that Wiley authors are set to receive remuneration for the licensing of their work based on their “contractual terms”.
That might mean they get nothing, if there is no explicit clause in their contract about sharing AI licensing income. For example, here’s what is happening with the publisher Taylor & Francis:
In July, authors hit out another academic publisher, Taylor & Francis, the parent company of Routledge, over an AI deal with Microsoft worth $10 million, claiming they were not given the opportunity to opt out and are receiving no extra payment for the use of their research by the tech company. T&F later confirmed it was set to make $75 million from two AI partnership deals.
It’s not just in the world of academic publishing that deals are being struck. Back in July, Forbes reported on a “flurry of AI licensing activity”:
The most active area for individual deals right now by far—judging from publicly known deals—is news and journalism. Over the past year, organizations including Vox Media (parent of New York magazine, The Verge, and Eater), News Corp (Wall Street Journal, New York Post, The Times (London)), Dotdash Meredith (People, Entertainment Weekly, InStyle), Time, The Atlantic, Financial Times, and European giants such as Le Monde of France, Axel Springer of Germany, and Prisa Media of Spain have each made licensing deals with OpenAI.
In the absence of any public promises to pass on some of the money these licensing deals will bring, it is not unreasonable to assume that journalists won’t be seeing much if any of it, just as they aren’t seeing much from the link tax.
The increasing number of such licensing deals between publishers and AI companies shows that the former aren’t really too worried about the latter ingesting huge quantities of material for training their AI systems, provided they get paid. And the fact that there is no sign of this money being passed on in its entirety to the people who actually created that material, also confirms that publishers don’t really care about creators. In other words, it’s pretty much what was the status quo before generative AI came along. For doing nothing, the intermediaries are extracting money from the digital giants by invoking the creators and their copyrights. Those creators do all the work, but once again see little to no benefit from the deals that are being signed behind closed doors.
To be honest, I’m somewhat amazed that more copyright lawsuits haven’t been filed against Twitter yet. There have been multiple reports of how the company’s DMCA takedown response systems have been broken/ignored since Musk took over. Without looking for it, I’ve seen full length high def movies show up in my Twitter feed (including movies still in theaters).
Still, it’s a bit surprising that the first such lawsuit is not from a Hollywood studio, but rather a big giant list of music publishers. And I’m pretty sure that Twitter has a strong case, if Elon bothers to hire competent copyright attorneys.
The backstory here is that music publishers (who are different than the record labels, even if some are connected to labels) have been demanding that Twitter license content for years. And, for years, Twitter correctly pointed out that it abides by the DMCA, and takes down copyright-infringing works when it receives a proper takedown notice. This is exactly what the law allows them to do, and it’s not as if Twitter is where people go to listen to music (and what music does get posted is generally hosted elsewhere and posted in a promotional manner). So, really, the idea that Twitter had to get a license from the publishers was always a stretch.
Still, almost immediately after Elon announced his bid for Twitter, the music publishers started agitating for him to license compositions. But, this is Elon Musk we’re talking about. The man won’t even pay his rent, or his cloud computing bills. Did anyone actually think he would pay for publisher licenses he doesn’t even need? So, it was little surprise when there were reports earlier this year that the talks had “stalled.”
And now there’s a lawsuit. But it doesn’t seem like a particularly strong one:
This is a civil action seeking damages and injunctive relief for Twitter’s willful copyright infringement. Twitter fuels its business with countless infringing copies of musical compositions, violating Publishers’ and others’ exclusive rights under copyright law. While numerous Twitter competitors recognize the need for proper licenses and agreements for the use of musical compositions on their platforms, Twitter does not, and instead breeds massive copyright infringement that harms music creators.
I mean, first of all… what? I’ve been an avid Twitter users from 2008 through 2022 and I honestly can’t recall ever encountering music in any significant way, or if I did, it was links to licensed sources such as Spotify, Apple Music, YouTube or whatever.
The only reason to do such a license is if you’re actually hosting music (and even then the DMCA should protect you, but most sites choose to get a license mainly to get the industry to stop constantly screaming at them and so that they don’t have to constantly play DMCA takedown whac-a-mole).
And, some of this is just nonsense:
Twitter knows perfectly well that neither it nor users of the Twitter platform have secured licenses for the rampant use of music being made on its platform as complained of herein. Nonetheless, in connection with its highly interactive platform, Twitter consistently and knowingly hosts and streams infringing copies of musical compositions, including ones uploaded by or streamed to Tennessee residents and including specific infringing material that Twitter knows is infringing. Twitter also routinely continues to provide specific known repeat infringers with use of the Twitter platform, which they use for more infringement.
The standard here has to be specific, actual knowledge of infringing works, not general knowledge that some people on the platform sometimes post infringing works. And while the paragraph above alleges “specific infringing material that Twitter knows is infringing,” it’s not actually that simple. That’s the same sort of argument that Viacom made against YouTube and failed with. In that case, Viacom also insisted that YouTube had to know these works were infringing and the court said that’s not how it works. And it’s even more limited in this case because the publishers say that Twitter “knows” that its “users” have not secured licenses, but does not suggest how they know this at all. It’s entirely possible that some of the users have, in fact, secured licenses. Or, as noted, that they’re just posting videos from elsewhere that is licensed. The publishers know this, so this is just misleading nonsense.
Twitter profits handsomely from its infringement of Publishers’ repertoires of musical compositions. The audio and audio-visual recordings embodying those compositions attract and retain users (both account holders and visitors) and drive engagement, thereby furthering Twitter’s lucrative advertising business and other revenue streams.
I doubt this very much. First, again, who goes to Twitter for the music? Second, (also, again) the vast majority of music is linked to on other sites, not hosted by Twitter. Yes, Twitter hosts some video, and yes, Elon expanded how much can be posted, but it’s still a stretch to argue that Twitter is “profiting” from music on its platform.
This is just typical National Music Publishers Association (NMPA) nonsense, in which they falsely insist that no one does anything for any reason except to seek out their music, and that they should be paid for every listen.
Still, there are some things in here that suggest that Musk, in ways that only an incompetent Musk would do, has made his own situation worse. The key bits:
Twitter has repeatedly failed to take the most basic step of expeditiously removing, or disabling access to, the infringing material identified by the infringement notices. Twitter has also continued to assist known repeat infringers with their infringement. Those repeat offenders do not face a realistic threat of Twitter terminating their accounts and thus the cycle of infringement continues across the Twitter platform.
If that’s actually what’s happening, then that would be problematic. The complaint does point to an example of “a known repeat infringer” which at least raises some questions:
The screenshot below illustrates Twitter’s monetization of infringing content. This infringing tweet is from a known repeat infringer who has been the subject of at least nine infringement notices to Twitter, identifying at least fourteen infringing tweets, which contained unauthorized copies of Publishers’ musical compositions. Directly below the infringing tweet is a paid “Promoted” tweet selected by Twitter. To the right of the infringing tweet is a paid “Promoted” account recommended by Twitter. Twitter’s account recommendations also include another known repeat infringer, Twitter Account A, identified in paragraph 166 below.
I’m at least a little confused by this. From what I see there, it’s not at all clear that the original tweet is hosted audio. It’s possible, but normally when there’s a video player it shows with the indicators of a video player. And, honestly, the fact that there are other promoted tweets or recommendations is mostly meaningless for the copyright issues at play.
As for the repeat infringer question, the DMCA requires that companies have a “reasonably implemented” repeat infringer policy, but does not specify exactly how it works, so just claiming that there are repeat infringers on the site, without more info, does not prove that Twitter would be liable for infringement (it could be, I’m just noting that the complaint is pretty weak on this point). The legal battles around this are always about whether or not a particular policy is reasonably implemented, and without more info it’s difficult to know if Twitter’s would be.
Later in the lawsuit there are lots of complaints about how long it takes Twitter to review DMCA takedowns, which might be indicative of a real problem… but might not be:
The precise extent of Twitter’s lengthy delays will be the subject of discovery and analysis, including through a review of Twitter’s records. In the meantime, by way of an example, the musical composition “What a Wonderful World,” written by Bob Thiele and George David Weiss and performed by Louis Armstrong, is a timeless classic, chosen by Rolling Stone in September 2021 as one of the top 200 songs of all time. Unauthorized audio and audio-visual recordings that embody “What a Wonderful World” are rampant on the Twitter platform, and Twitter has failed repeatedly to take them down in an expeditious manner. Across all the NMPA Notices sent to Twitter that identified the musical composition for “What a Wonderful World” by name, along with precise URLs for the tweets containing the infringing uses of that composition, Twitter failed to take down at least 240 infringing tweets incorporating “What a Wonderful World” within 14 days after the NMPA Notice was sent. Even more troubling, over 120 of those tweets were still available at least a month after the associated NMPA notice was sent to Twitter, and more than two dozen tweets were still available on Twitter over two months after NMPA sent a notice identifying them as infringing.
Seems like an odd choice to use, as an example, a song that is literally 56 years old, which at the time it was published had a maximum copyright term of 56 years? Yes, the song is still under copyright thanks to endless copyright term extensions, but… still. You’d think they’d pick another song.
Also, the lawsuit misrepresents Twitter’s marketing claims about Twitter and music, which tend to be about communities of fans, not posting actual music (again, that’s not really a Twitter thing).
Twitter has been outspoken about how important music is to Twitter and users of its platform. In its marketing, blogs, or tweets, Twitter stated:
a. “[M]usic is the largest community” on Twitter’s platform, where “people are more likely to follow a music-related account than any other type of account on Twitter.”
b. The Twitter platform is “the ultimate connection to the music world for fans and brands.”
c. “Every day, more than 30 million tweets are published about music around the world . . . [which is] more than 20,000 every minute.”
Twitter even has its own “@TwitterMusic” account on its platform dedicated to top music trends, which has a massive following of 11.5 million users
I mean, literally none of that has anything to do with infringing content. It’s mostly about music fans and connecting with artists. Not listening to music on the platform. It’s just designed to sound bad, despite being wholly unrelated to the actual copyright question.
Now, there are some things that Elon has done that may cause him trouble in court. Recently departed trust & safety boss Ella Irwin (stupidly) announced that the company wouldn’t suspend users unless “it is clear the user knew the content was illegal.”
While that may seem commendable in some ways, it might conflict with the DMCA’s requirements regarding repeat infringer policies. At least, the NMPA sure claims it does:
Twitter has told users of its platform that “[w]e don’t suspend users for posting reported content unless it is clear that the user knew the content was illegal.” But Twitter’s practice is unreasonable and contrary to law. Infringement occurs as a matter of law. Direct infringement is a strict liability offense, without any requirement that the infringer know the content they post is illegal.
Except… that’s not entirely accurate by the NMPA either. While the courts have definitely moved in that direction, some still do recognize the concept of innocent infringement (and, frankly, copyright law would be a lot more reasonable if the courts went back to understanding this).
There are other Elon decisions that the complaint calls out, but some are silly and have nothing to do with the copyright questions:
Instead of grounding decisions on sound policy development and reasonable implementation, Twitter has outsourced trust and safety decisions to Twitter polls, i.e., votes among users of the Twitter platform, through a feature on the platform used for polling.
But… there is another thing the lawsuit calls out which MANY copyright lawyers freaked out about last month, when a Twitter user appeared to complain that they were being unfairly hit with copyright claims and Elon told the user to try “turning on subscriptions.”
I saw multiple copyright lawyers freak out about this and try to warn Musk that this tweet would show up in copyright lawsuits. At the time, I looked into the issue and… while it looks bad, it’s not as bad as it seems. The “Figen” account does not appear to actually be infringing on copyrights. It actually is linking to the original uploads by the original users (those might be infringing, but most did appear to be from the original creator of the work). This is a confusing bit of how Twitter works, when you can “repost” someone else’s video, but you’re really just linking to their upload.
Still, this incident shows up in the lawsuit (somewhat obliquely):
By way of another example, a user tweeted that Twitter should not suspend accounts for receiving multiple copyright notices but rather should only disable the copyrighted videos. That user asserted that the user does not earn money from the videos they share, or understand that they are copyrighted, and that copyright owners should ask Twitter users to remove the videos rather than submit notices to Twitter. Twitter replied publicly to this user, but without asking the user not to infringe, without referring the user to Twitter’s Copyright policy, and without telling the user that copyright infringement is unlawful regardless of whether the user makes money from it or realizes that a particular video is infringing. Instead, Twitter suggested that the user “consider turning on subscriptions”—a feature of Twitter Blue that garners revenue for Twitter, enables the user to receive payments from other users of the Twitter platform, and, because the infringing tweets are behind a paywall, makes it more difficult for copyright owners to find.
So, this one goes both ways. If you understand that Figen wasn’t actually infringing, then Elon’s statement isn’t so bad. But it’s not even clear that Elon realized this user wasn’t actually infringing. And if he did believe the account was infringing then… yeah… that’s bad. But, also, it’s not at all a surprise this showed up in a lawsuit.
And then there’s this:
I mean, this is another case where Elon is correct, but that plays badly if you’re in a lawsuit for ignoring DMCA takedowns, and of course the NMPA calls it out.
Twitter’s most senior executive has previously described the Digital Millennium Copyright Act (“DMCA”)—a statute that, among other things, provides for notice and takedown of infringing copyrighted material—as a “plague on humanity.”… This statement and others like it exert pressure on Twitter employees, including those in its trust and safety team, on issues relating to copyright and infringement.
So, anyway, this is not a particularly strong lawsuit, but it’s not a joke either. It’s got many aspects where Elon and his inability to shut the fuck up clearly made things worse. But it does seem like the kind of copyright lawsuit that Twitter could win if it had competent copyright litigators to handle it.
Which means, the question is: can Elon actually hire a competent copyright litigator these days?
One of the (many) villains in “Walled Culture” the book (free ebook versions) is the publishing industry, specifically in the context of the transition from analogue books to ebooks. What could have been one of the most important expansions of the power and possibility of the book form became instead its opposite – a diminishment of both. As a result of publishers’ greed, ebooks became something you rented, rather than owned. Libraries are particularly hard hit: publishers typically only allow the books they license to educational establishments to be lent out for a limited number of times, or for a limited period. Publishers achieved the feat of using the shift to powerful digital technologies to make books less useful, purely in order to boost their profits.
The Walled Culture book explains in detail how the industry was able to do that thanks to bad copyright laws being abused yet further. But there’s a footnote to this transition that I was unaware of when I wrote my history of copyright in the digital age, but which underlines the extent to which most publishers are driven purely by the bottom line, and care little for readers or writers.
It concerns the taxing of books in the UK. Most goods there are subject to a Value Added Tax (VAT), which is a simple percentage of the sale price – generally 20%. However, certain classes of goods are exempt: this applies to things like food, children’s clothing, and also books. Or rather, to physical books: one quirk of the early ebook market was that ebooks were taxed at 20%, even though physical books were not. This led to a 2018 campaign with the catchy slogan “Axe the reading tax”. It was led by the Publishers Association, which wrote in a press release at the time:
Stephen Lotinga, CEO of the Publishers Association, said: “The government must do everything it can to cut the unfair tax on ebooks, magazine and newspaper online subscriptions.
“It makes no sense in the modern world that readers are being penalised with an additional 20% tax for choosing to embrace digital.
“Whether a book, newspaper or magazine is electronic does not change the principle that we should not be taxing reading and learning.
It was a powerful campaign, backed by just about everyone who cared about books, reading, education and knowledge. It had an extensive Web site Axethereadingtax.org, with lots of very good reasons why the tax should be abolished, such as:
A simpler VAT regime would benefit universities and libraries in terms of freeing up resource and money, as well as students buying educational materials.
And…
Digital formats are vital for the blind and partially sighted, who can listen to audiobooks or read in the largest print sizes on electronic devices, for those with dyslexia and for elderly or disabled people who may lack the physical capabilities to handle print books easily.
Removing the VAT from ebooks and epublications would mean that people who buy them would benefit from lower prices. The impact on the government would be a modest reduction in VAT revenues and is small relative to reduced VAT revenues from other goods and services which are zero-rated, including caravans and hot takeaway food.
The good news is that in 2020, the UK government finally removed the 20% VAT on ebooks. The Publishers Association was rightly triumphant:
We are thrilled that, as of 1 May 2020, the unfair 20% VAT on eBooks and digital newspapers, magazines and journals has been removed. Knowledge and learning are vital, whatever format you favour.
The VAT cut means that ebook publishers could have cut their prices by 17% and made the same profit. They didn’t. Over this period there were 8%+ price reductions for comparable products – computer game and app downloads – where there was no VAT cut. There were no overall price reductions for ebooks.
We also analysed individual pricing data for the 30 best-selling ebooks on Amazon UK in 2020 (as Amazon is by far the most significant ebook retailer). Only four out of thirty showed a sustained price reduction which could plausibly have been attributed to the May 2020 VAT cut. That likely overstates the effect.
UK government figures show that dropping VAT on ebooks cost the state £200 million. In theory, that is £200 million that could have flowed to everyone buying ebooks, in the form of lower prices. Here’s where it actually went:
Amazon generally retains a royalty of around 30%, so we can say that of the £200m annual cost of the VAT abolition, Amazon received about £60m and publishers/authors about £140m.
To put these figures in context, the publishing industry’s UK profit in 2021 was probably around £200m. Even after increased author royalty payments, this looks like a very significant enhancement to publisher profitability.
This is a perfect example of the how the copyright world operates. It lobbies for changes in the law, claiming that the public is suffering in some way, and exploits the willingness of creators to help put pressure on the government to right that wrong. But when those changes are made, the companies do not pass on the benefits to the public or creators, but keep most of it for themselves.
In the case of axing the reading tax, it was indeed axed – but none of the claimed benefits for universities, or the blind and partially sighted materialized. The publishers kept book prices the same, which means that they picked up an extra 20% of an ebook’s price, since they no longer had to pay VAT. In effect, the tax was still there, but now it simply went to publishers, not the government. All the problems the Publishers Association complained about in terms of the harm to books, reading, learning and education remain. But publishers have become much richer for zero additional work, so suddenly these things don’t matter any more…
Back in August last year, Techdirt covered a major announcement by the US government that all taxpayer-supported research should be immediately available to the public at no cost. As Mike wrote at the time, this is really big, not least for the following key element mentioned in the press release:
This policy guidance will end the current optional embargo that allows scientific publishers to put taxpayer-funded research behind a subscription-based paywall – which may block access for innovators for whom the paywall is a barrier, even barring scientists and their academic institutions from access to their own research findings.
The idea that researchers can’t share or even access their own work might seem absurd – well, it is absurd – but it is also something that happens depressingly often in the academic world, and is one reason why so many people turn to things like Sci-Hub. The new policy addresses this by requiring free and immediate access. However, this only applies to US-funded research, which means that even when it comes into force in 2025, there will still be millions of articles that cannot be accessed and shared freely because they are funded by agencies in other nations.
It would be great if all the funders outside the US could adopt a similar policy requiring immediate free access. The UK is one country that has already taken this approach. In April 2022, the main government funding body in England, UK Research and Innovation (UKRI), made it mandatory for all the results of research that it funds to be made immediately available as open access when they are published in journals.
There are three ways for academics to do this. One is to use Gold open access, where typically an article processing charge is paid by the the researcher’s institution to make it freely available immediately. Another is something called “Read & Publish”, a kind of transitional approach, where a publisher receives two bundled payments – a traditional one to publish the article, and another to allow anyone to read it. The final option is Green open access, whereby the researcher’s manuscript is placed in some kind of online repository where it can be downloaded by anyone for free. However, a problem has arisen with this last approach, as the N8 Research Partnership, which represents 12% of all UK academics and 200,000 students there, explains:
in order to achieve this third route to open access researchers need to be able to apply a CC BY license – which allows anyone to make commercial use of the work under the condition of attributing the research in the manner specified by the author or licensor – and place their accepted manuscript in an institutional or other preferred repository. This must now be done without embargo granted to any publisher [under UKRI rules].
However, some publishers are no longer compliant with several not accepting that a researcher’s original rights should be retained by them, meaning that publishers may not accept manuscripts where an application has been made for a CC BY license and the researcher has clearly stated that they own their research.
The key issue is that CC BY may not be an option unless researchers retain the copyright in their articles. It might seem extraordinary, but in addition to providing their work to publishers for free, academics have generally been required to hand over the copyright as well, effectively losing control of their own research results.
The solution to this problem is simple: researchers should retain the copyright in their work, and the N8 Research Partnership now requires its members to enforce this if a publisher refuses to allow a CC BY license. However, the new policy does not make this mandatory for the other ways of publishing – Gold open access or Read & Publish. This is a huge mistake. There is no reason for a researcher to assign their copyright to a publisher – a non-exclusive license is all that the latter requires. By allowing the practise to continue, N8 is implicitly condoning it in some situations. N8 is not alone in this, but the failure by funders to require rights retention as a matter of course is perhaps the biggest obstacle to rolling out full open access around the world today.
A year ago, Techdirt wrote about an important lawsuit in India, brought by the academic publishers Elsevier, Wiley, and the American Chemical Society against Sci-Hub and the similar Libgen. A couple of factors make this particular legal action different from previous attempts to shut down these sites. First, an Indian court ruled in 2016 that photocopying textbooks for educational purposes is fair use; the parallels with SciHub, which provides free access to copies of academic papers for students and researchers who might not otherwise be able to afford the high subscription fees, are clear. Secondly, the person behind Sci-Hub, Alexandra Elbakyan, is fighting, rather than ignoring, the case, as she has done on previous occasions.
One manifestation of her new pro-active approach is a tweet she posted recently. It included a screenshot of an email she wrote to Nature magazine, which had contacted her about a forthcoming article on the Indian court case. Following standard practice, the journalist writing the article, Holly Else, asked Elbakyan to comment on some of the accusations the academic publishers had made against Sci-Hub. Her responses are fascinating, not least because they provide Elbakyan’s perspective on several important issues.
For example, according to the publishers’ comments as transmitted by Else, “Pirate sites like Sci-Hub threaten the integrity of the scientific record, and the safety of university and personal data”. In reply, Elbakyan points out Sci-Hub is unique, and the use of the phrase “Pirate sites like Sci-Hub” is a clever attempt to lump Sci-Hub in with quite different sites, thus prejudging the legality of its activities. Elbakyan says that it’s academic publishers — not Sci-Hub — which threaten the progress of science:
open communication is [a] fundamental property of science and it makes scientific progress possible. Paywalled access prevents this and is a great threat to science. Also the great threat is also when the whole scientific knowledge became the private property of some corporation such as Elsevier, that has full control of it. That is the threat, not Sci-Hub.
Elbakyan points out that Sci-Hub doesn’t threaten the “integrity of the scientific record”, since she simply disseminates copies of the academic papers without changing them in any way. But perhaps the most interesting part of her reply concerns the accusation that Sci-Hub threatens the safety of university and personal data. Techdirt has written previously about claims that Elbakyan allegedly has links to Russian intelligence, and that Sci-Hub is some kind of security risk. According to Else, the publishers assert:
Pirate sites like Sci-Hub compromise the security of libraries and higher education institutions to gain unauthorized access to scientific databases and other proprietary intellectual property, and illegally harvest journal articles and e-books.
Sci-Hub uses stolen user credentials and phishing attack to extract copyrighted articles illegally
These are serious allegations, and ones that have been made several times in the past. Elbakyan’s response is probably the first time that she has addressed them directly:
Do they have any actual case when Sci-Hub somehow compromised the security of any library or a person? Any person that complained about credentials that were ‘stolen’ from them? Or is it again, nothing more than empty accusations. Nobody is complaining about ‘compromised security’ except academic publishers.
In other words, it is time for Elbakyan’s accusers to put up or shut up. She concludes by stating that “Any law against knowledge is fundamentally unjust”, and hopes that “Nature will have enough honesty to publish my comments in full.
Techdirt has noted in the past that if public libraries didn’t exist, the copyright industry would never allow them to be created. Publishers can’t go back in time to change history (fortunately). But the COVID pandemic, which largely stopped people borrowing physical books, presented publishers with a huge opportunity to make the lending of newly-popular ebooks by libraries as hard as possible.
A UK campaign to fight that development in the world of academic publishing, called #ebookSOS, spells out the problems. Ebooks are frequently unavailable to institutions to license as ebooks. When they are on offer, they can be ten or more times the cost of the same paper book. The #ebookSOS campaign has put together a spreadsheet listing dozens of named examples. One title cost ?29.99 as a physical book, and ?1,306.32 for a single-user ebook license. As if those prices weren’t high enough, it’s common for publishers to raise the cost with no warning, and to withdraw ebook licenses already purchased. One of the worst aspects is the following:
Publishers are increasingly offering titles via an etextbook model, via third party companies, licensing content for use by specific, very restricted, cohorts of students on an annual basis. Quotes for these are usually hundreds, or sometimes thousands, times more than a print title, and this must be paid each year for new cohorts of students to gain access. This is exclusionary, restricts interdisciplinary research, and is unsustainable.
Although #ebookSOS is a UK campaign, the problem is global, as publishers try to change the nature of ebook lending everywhere. Ron Wyden and Anna Eshoo have noticed that it’s happening in the US, and seem unimpressed by the publishing industry’s moves, as a letter to the CEO of Penguin Random House (pdf) makes clear:
Many libraries face financial and practical challenges in making e-books available to their patrons, which jeopardizes their ability to fulfill their mission. It is our understanding that these difficulties arise because e-books are typically offered under more expensive and limited licensing agreements, unlike print books that libraries can typically purchase, own, and lend on their own terms. These licensing agreements, with terms set by individual publishers, often include restrictions on lending, transfer, and reproduction, which may conflict with libraries’ ability to loan books, as well as with copyright exceptions and limitations. Under these arrangements, libraries are forced to rent books through very restrictive agreements that look like leases.
The letter asks for answers to nine detailed questions about any restrictions imposed on ebook use, the pricing of both physical and digital books, as well as information about any legal actions that have been taken in response to things like multiple checkouts of digital texts, interlibrary loans, controlled digital lending, and institutions making digital copies of physical books they own.
This is a hugely important battle, since it’s clear the publishing world sees it as a unique chance to redefine what libraries can do with ebooks. It’s part of the much larger, very troubling trend to turn everyone into renters, and to bring about the end of ownership.
A huge and potentially important copyright lawsuit was filed this week by basically all of the big music publishers against the immensely popular kids’ gaming platform Roblox. Although the publishers trade association, the NMPA, put out a press release claiming the lawsuit, it doesn’t appear that NMPA is actually a party. The lawsuit is, in many ways, yet another full frontal assault on the DMCA’s safe harbors by the legacy music industry. There’s a lot in this lawsuit and no single article is going to cover it all, but we’ll hit on a few high points.
First, this may seem like a minor point, but I do wonder if it will become important: buried in the massive filing, the publishers mention that Roblox did not have a registered DMCA agent. That seems absolutely shocking, and potentially an astoundingly stupid oversight by Roblox. And there’s at least some evidence that it’s true. Looking now, Roblox doeshave a registration, but it looks like it was made on… June 9, the day the lawsuit was filed.
Wow. Now, that may seem embarrassing, but it might actually be more embarrassing for the Copyright Office and raise a significant and important legal question. Because it appears that Roblox did at one time have a DMCA agent registration but, as you may recall, back in 2016, the Copyright Office unilaterally decided to throw out all of those registrations and force everyone to renew (and then to renew again every three years through a convoluted and broken process).
There’s an argument to be made that the Copyright Office can’t actually do this. The law itself just says you need to provide the Copyright Office with the information, not that it needs to be renewed. The Copyright Office just made up that part. Perhaps we finally have a test case on our hands to see whether or not the Copyright Office fucked up in dumping everyone’s registration.
Still, that’s a minor point in the larger lawsuit. The publishers throw a lot of theories against the wall, hoping some will stick. It seems like most should be rejected under the DMCA’s safe harbors, because it truly is user generated content, even if the lawsuit tries a variety of approaches to get around that. Part of the lawsuit argues contributory and vicarious copyright infringement, more or less pulling the “inducement” theory from the Grokster ruling, which basically says that if you as a company encourage your users to infringe, you could still be liable (this is, notably, nowhere in the actual law — it’s just what the Supreme Court decided).
But to get there, the lawyers for the music publishers seem to want to take a Roblox executive’s comments completely out of context, in a somewhat astounding manner. The “proof” that Roblox is encouraging people to infringe is here:
Roblox is well aware that its platform is built and thrives on the
availability of copyrighted music. As Jon Vlassopulos, Roblox?s global head of
music, publicly stated just last year: ?We want developers to have great music to
build games. We want the music to be, not production music, but really great
[commercial] music.? (Alteration in original). To that end, Roblox actively
encourages its users to upload audio files containing copyrighted music and
incorporate them into game content on the Roblox platform. Roblox advertises the
importance of music in games and makes it easy for users to upload, share, and
stream full-length songs.
But… if you read the article that they’re using for that Vlassopulos quote, it’s not directed at developers and users of their platform. It’s targeted at musicians and the music industry. The whole point of the quote is to let musicians and the industry know that Roblox is open to licensing deals. It’s pretty obnoxious to try to spin that as encouraging people to infringe when, in context, it sure looks like the exact opposite. I mean, literally the next sentence (which doesn’t make it into the lawsuit) is about how they’re “testing the waters” by making a deal with a small indie label to make all of its music available on Roblox.
So it seems to be Roblox saying the exact opposite of what the publishers are claiming. That’s… kinda fucked up.
The lawsuit also tries to spin the impossible task of trying to moderate as proof that any failures in moderation are deliberate.
There is no question that Roblox has the right and ability to stop or
limit the infringement on its platform. But Roblox refuses to do so, so that it can
continue to reap huge profits from the availability of unlicensed music. While
Roblox touts itself as a platform for ?user-generated? content, in reality, it is
Roblox?not users?that consciously selects what content appears on its platform.
Roblox is highly selective about what content it publishes, employing over a
thousand human moderators to extensively pre-screen and review each and
every audio file uploaded. Roblox?s intimate review process includes review of
every piece of copyrighted music, generally identified by title and artist?to ensure
that it meets Roblox?s stringent and detailed content guidelines and community
rules. This process ensures that Roblox plays an integral role in monitoring and
regulating the online behavior of its young users.
Roblox thus unquestionably exercises substantial influence over its
users and the content on its platform, ostensibly in the name of ?safety.? Yet
Roblox allows a prodigious level of infringing material through its gates, purposely
turning a blind eye for the sake of profits. Rather than take responsibility, Roblox
absurdly attempts to pass the obligation to its users?many of whom are young
children?to represent to Roblox that they own the copyrights to the works they
have uploaded.
Coincidentally, just last week we published our content moderation case study on Roblox, focused on how it tries to stop “adult” content on the platform. We noted that the company is very aggressive and hands-on with its moderation efforts but (importantly) it still makes mistakes, because every content moderation system at scale will make mistakes.
So just because Roblox is aggressive in its moderation, and even if it says it reviews everything, that doesn’t mean that it “refuses” to stop infringement. It just means it doesn’t catch it all. Indeed, the company has said in the past that it uses an automated third party monitoring tool to try to catch unauthorized songs (though, notably, this lawsuit is about the publishing rights, not the recording rights, so arguably a monitoring tool might catch some sound recordings while missing other songs that implicate songwriters/publishers — but that’s getting super deep in the weeds).
Indeed, the impossibility of catching everything — while still encouraging websites to try — is why we want things like Section 512 of the DMCA or Section 230 of the CDA. If you suddenly make websites liable for any mistakes they let through, then you create a huge problem. And claiming that their aggressive moderation implicates them even more only encourages sites to do less moderation in the long run.
But, the publishers don’t care about that. Their end goal is clear: as in the EU, they want to force every website to have to buy a blanket license for music. They basically want to do away with the DMCA altogether, then just sit back and collect payments. They want to change the internet almost entirely from a tool for end users to a cash register for music publishers.
There are some other oddities in the lawsuit. It repeatedly tries to claim that Roblox is liable for direct infringement itself, but that theory seems like a stretch. Even the filings admit that the music is all uploaded by users:
Despite Roblox?s written policies, users regularly upload files
containing copyrighted music. The act of ?uploading? a file to Roblox involves
the user making a copy of the file and distributing it to Roblox, where it is then
hosted on Roblox?s servers.
To upload an audio file, a user simply opens the Roblox Studio and
clicks on a tab marked ?Audio,? which then prompts the user to choose a file on
their local hard drive, in either .mp3 or .ogg format to be copied and distributed to
Roblox?s servers.
It tries to build out the inducement theory by saying that because Roblox encourages developers to use music in their games, and this is the same as encouraging infringement, but that’s nonsense. Nothing in what Roblox says encourages infringement. They’re just saying that sound and music can enhance a game. Which is clearly true.
Roblox makes the process of uploading infringing music extremely
easy for users. Roblox even published an article designed to encourage developers to add music to their games, which explains: ?While building a game, it?s easy to
overlook the importance of sounds and music.? (Emphasis added).4 That page
gives users step-by-step instructions on how to copy and distribute their music files
to the Roblox platform.
So what? That’s not telling users to infringe. If anything, it’s saying “find some music you’re able to add to this legally.” You’d think that publishers would be happy about that, as it opens up a new line of business where they could license their music, which is what the Roblox exec was talking about at the beginning. But leave it to the greedy publishers to not want to do the hard work here, and instead try to force a big company into a big payment.
Roblox has already put out a statement saying (not surprisingly) that it’s “surprised and disappointed” by the lawsuit. It seems likely that it will mount an aggressive defense, and it could be yet another important case in seeing whether or not the legacy music industry is able to chip away at another important aspect of the DMCA, and to force all websites that host third party content to buy blanket licenses.
?As a platform powered by a community of creators, we are passionate about protecting intellectual property rights ? from independent artists and songwriters, to music labels and publishers ? and require all Roblox community members to abide by our Community Rules,? said the statement.
?We do not tolerate copyright infringement, which is why we use industry-leading, advanced filtering technology to detect and prohibit unauthorised recordings. We expeditiously respond to any valid Digital Millennium Copyright Act (DMCA) request by removing any infringing content and, in accordance with our stringent repeat infringer policy, taking action against anyone violating our rules.?
?We are surprised and disappointed by this lawsuit which represents a fundamental misunderstanding of how the Roblox platform operates, and will defend Roblox vigorously as we work to achieve a fair resolution,? continued Roblox?s statement.
Of course, this is par for the course for the legacy industry — especially the publishers as lead by the NMPA’s David Israelite. They wait for various internet services to get popular, and then rather than figuring out how that helps them, they sue. It’s how they constantly kill the golden goose. They’ve done it with various internet music services, music games, and more. They’re currently trying to do it with Twitch and now Roblox as well. They overvalue the music component, and choke off the long term business prospects for these platforms, many of which have music as an ancillary add-on.
It’s silly, short-sighted, and anti-culture. In other words, it’s the legacy music industry’s usual playbook.
Techdirt has been following the saga of the City of London Police’s special “Intellectual Property Crime Unit” (PIPCU) since it was formed back in 2013. It has not been an uplifting story. PIPCU seems to regard itself as Hollywood’s private police force worldwide, trying to stop copyright infringement online, but without much understanding of how the Internet works, or even regard for the law, as a post back in 2014 detailed. PIPCU rather dropped off the radar, until last week, when its dire warnings about a new, deadly threat to the wondrous world of copyright were picked up by a number of gullible journalists. PIPCU’s breathless press release reveals the shocking truth: innocent young minds are being encouraged to access knowledge, funded by the public, as widely as possible. Yes, PIPCU has discovered Sci-Hub:
Sci-Hub obtains the papers through a variety of malicious means, such as the use of phishing emails to trick university staff and students into divulging their login credentials. Sci Hub then use this to compromise the university’s network and download the research papers.
That repeats an unsubstantiated claim about Sci-Hub that has frequently been made by academic publishers. And simply using somebody’s login credentials does not constitute “compromising” the university’s network, since at most it gives access to course details and academic papers: believe it or not, students are not generally given unrestricted access to university financial or personnel systems. The press release goes on:
Visitors to the site are very vulnerable to having their credentials stolen, which once obtained, are used by Sci-Hub to access further academic journals for free, and continue to pose a threat to intellectual property rights.
This is complete nonsense. It was obviously written by someone who has never accessed Sci-Hub, since there is no attempt anywhere to ask visitors for any information about anything. The site simply offers friction-free access to 85 million academic papers — and not “70 million” papers as the press release claims, further proof the author never even looked at the site. Even more ridiculous is the following:
With more students now studying from home and having more online lectures, it is vital universities prevent students accessing the stolen information on the university network. This will not only prevent the universities from having their own credentials stolen, but also those of their students, and potentially the credentials of other members of the households, if connected to the same internet provider.
When students are studying from home, they won’t be using the university network if they access Sci-Hub, but their own Internet connection. And again, even if they do visit, they won’t have their credentials “stolen”, because that’s not how the site works. And the idea that members of the same household could also have their “credentials” stolen simply by virtue of being connected to the same Internet provider is so wrong you have to wonder whether the person writing it even knows how the modern (encrypted) Internet works.
But beyond the sheer wrongness of the claims being made here, there’s another, more interesting aspect. Techdirt readers may recall a post from a few months back that analyzed how publishers in the form of the Scholarly Networks Security Initiative were trying to claim that using Sci-Hub was a terrible security risk — rather as PIPCU is now doing, and employing much of the same groundless scare-mongering. It’s almost as if PIPCU, always happy to toe Big Copyright’s line, has uncritically taken a few talking points from the Scholarly Networks Security Initiative and repackaged in them in the current sensationalist press release. It would be great to know whether PIPCU and the Scholarly Networks Security Initiative have been talking about Sci-Hub recently. So I’ve submitted a Freedom of Information request to find out.
Last month Techdirt wrote about some ridiculous scaremongering from Elsevier against Sci-Hub, which the publisher claimed was a “security risk”. Sci-Hub, with its 85 million academic papers, is an example of what are sometimes termed “shadow libraries”. For many people around the world, especially in developing countries, such shadow libraries are very often the only way medics, students and academics can access journals whose elevated Western-level subscription prices are simply unaffordable for them. That fact makes a new attack by Elsevier, Wiley and the American Chemical Society against Sci-Hub and the similar Libgen shadow library particularly troubling. The Indian title The Wire has the details:
the publishing giants are demanding that Sci-Hub and Libgen be completely blocked in India through a so-called dynamic injunction. The publishers claim that they own exclusive rights to the manuscripts they have published, and that Sci-Hub and Libgen are engaged in violating various exclusive rights conferred on them under copyright law by providing free access to their copyrighted contents.
Techdirt readers will note the outrageous claim there: that these publishers “own exclusive rights to the manuscripts they have published”. That’s only true in the sense that most publishers force academics to hand over the copyright as a condition of being published. The publishers don’t pay for that copyright, and contribute almost nothing to the final published paper save a little editing and formatting: manuscript review is carried out for free by other academics. And yet the publishers are demanding that Sci-Hub and Libgen should be blocked in India on this basis. Moreover, they want a “dynamic injunction”:
That is, once a defendant’s website is categorised as a “rogue website”, the plaintiff won’t have to go back to the judges to have any new domains blocked for sharing the same materials, and can simply get the injunction order extended with a request to the court’s deputy registrar.
The legal action by publishers against shadow libraries is part of a broader offensive around the world, but there’s a reason why they may face extra challenges in India — over and above the fact that Sci-Hub and Libgen contain huge quantities of material that can unambiguously be shared quite legally. As Techdirt reported back in 2013, a group of Western publishers sued Delhi University over photocopied versions of academic textbooks. For many students in India, this was the only way they could afford such educational materials. In 2016, the Indian court ruled that “copyright is not an inevitable, divine, or natural right”, and that photocopying textbooks is fair use.
The parallels with the new suit against Sci-Hub and Libgen are clear. The latter are digital photocopy sites: they make available copies of educational material to students and researchers who could not otherwise afford access to this knowledge. The copies made by Sci-Hub and Libgen should be seen for what they are: fair use of material that was in any case largely created using public funds for the betterment of humanity, not to boost the bottom line of publishers with profit margins of 35-40%.