Glyn Moody's Techdirt Profile

Glyn Moody

About Glyn Moody

Posted on Techdirt - 29 April 2026 @ 03:43pm

Leading Cancer Charity Stops Funding Open Access Publishing Because It’s Just Not Working

As numerous posts on this blog have emphasised, the underlying idea of open access (OA) – allowing anyone to read and share published academic research for free – is great in principle, but in practice has failed in important ways. That’s because traditional academic publishers have subverted the open access model to such an extent that the costs for research institutions of publishing in OA journals have barely changed at all. And yet one of the other key aims of open access was to save money while widening availability. Against that background, a natural question to ask is: if open access has failed to deliver savings, why bother supporting it? Cancer Research UK, the world’s leading cancer charity, has evidently asked itself that question and come up with an answer, which it explains in a post entitled “Why we won’t be funding open access publishing any more”:

We need efficient scholarly communications to spread scientific ideas via a fair economic model. We currently don’t have that. The open access movement was bold and promising, but ultimately disappointing. Now is the time to stop and call for a new way to make publishing work…

Ceasing to fund open access in the way we currently do will save us £5.2m of donors’ money over the next three years. That’s a substantial amount which can be put towards cancer research.

The post by Dan Burkwood, Director of Research Operations and Communications at Cancer Research UK, explains what exactly the problem is:

We currently fund open access publishing for our researchers in a number of ways. Despite hopes that this would enable a flourishing of open access dissemination of science, most of the growth has occurred in hybrid journals. These are publications that combine OA articles with those behind a paywall – this means the publishers will still charge for university and institute libraries to access them, even though researchers have paid for their work to be published. For us, this means we currently use donated money to fund our researchers, institutes and centres to publish OA research articles, yet they still have to pay to access the majority of journals in which those articles appear. The publishers are – so to speak – having their cake whilst also eating it.

These so-called “hybrid models” are discussed at length in Chapter 3 of Walled Culture the book (free digital versions available). They were presented as a transitional approach towards journals that were fully open access, but in many cases that transition hasn’t happened, not least because the hybrid model is so profitable for publishers, who therefore have little incentive to move to fully open access titles. Burkwood rightly points to a key reason why academic publishers continue to wield such power: the academic world’s insistence on using published articles in prestigious titles as a metric of success.

Cancer Research UK are working to widen the way we evaluate research in order to mitigate the heavy focus on publication outputs. It’s clear to us that a broader view of an applicant’s career is vital to gauge potential success. By signing up to DORA (San Francisco Declaration on Research Assessment), we encourage our reviewers to assess the quality and impact of research through means other than just journal impact factor. Additionally, we invite applicants to submit a narrative CV, allowing a more holistic view of their track record, research outputs and career progression.

But as he acknowledges, “Despite our, and others, attempts to limit the emphasis of the ‘publish-or-perish’ mindset, it will take time for the culture to change.” In the meantime, he suggests:

If researchers have no access to publishing funds they can publish their work for open access at no cost, but the publication will sit behind a paywall for 6 months (under embargo) before being deposited on Europe PMC open access – this is known as green open access.

Green open access provides full and free access to papers, but only after an embargo period, typically six months, but sometimes longer (gold open access provides instant access, but requires payment by researchers’ institutions.) That makes green OA a poor substitute for real, immediate open access.

The problem here is that such embargo periods have long been accepted as the norm, but that is only because a terrible blunder was made over two decades ago by the Research Councils UK (RCUK). In 2005, the RCUK stipulated that the work it funded would require open access publication. However, when the final version of the RCUK’s policy appeared in June 2006, it had a significant flaw, expressed in the following provision: ‘Full implementation of these requirements must be undertaken such that current copyright and licensing policies, for example embargo periods or provisions limiting the use of deposited content to non-commercial purposes, are respected by authors.’ As the leading open access scholar Peter Suber wrote at the time, this was a completely unnecessary concession:

Researchers sign funding contracts with the research councils long before they sign copyright transfer agreements with publishers. Funders have a right to dictate terms, such as mandated open access, precisely because they are upstream from publishers. If one condition of the funding contract is that the grantee will deposit the peer-reviewed version of any resulting publication in an open-access repository [immediately], then publishers have no right to intervene.

At the root of the issue of embargoes lies copyright. If researchers retained full control of the copyright of their articles, rather than assigning it to publishers, they could prevent any embargoes being applied to them.

Cancer Research UK’s decision is regrettable but understandable. The fear has to be that others will follow suit. While the hybrid model is not universal, it is widespread enough to undermine the open access idea. Until researchers refuse to publish in such hybrid titles, publishers will continue to profit from them. Given the unnecessary embargoes imposed on articles released under green open access, that leaves alternatives such as diamond open access, where there are no charges for anyone, an approach that has long been espoused on this blog.

Follow me @glynmoody on Mastodon and on Bluesky. Originally posted to Walled Culture.

Posted on Techdirt - 27 April 2026 @ 03:19pm

The Risks Of Anonymity In The Age Of Generative AI

As its name suggests, generative AI is designed to generate material in response to prompts by drawing on its probabilistic database built up through analyzing huge quantities of training input. But it can draw on those patterns to analyze other files, and that’s also a widely used application. Writing in The Argument, Kelsey Piper encountered an interesting variant of that approach:

Recently, Anthropic released a new version of Claude, Opus 4.7. I did what I usually do when a new AI model is released by Google, OpenAI, or Anthropic and ran a bunch of tests on it to see what it can do. One of those tests is to paste in some text from unpublished drafts of mine and ask it to guess the author.

From only the above text [not shown here], 125 words, Claude Opus 4.7 informed me that the likeliest author is Kelsey Piper. This is an Opus 4.7-specific power; ChatGPT guessed Yglesias, and Gemini guessed Scott Alexander. I did not have memory enabled, nor did I have information about me associated with my account; I did these tests in Incognito Mode.

As Piper admits:

this is far from an impossible feat of style identification — a lot of my writing is public on the internet, and this is clearly the start of a political column, narrowing the possible authors down dramatically.

She went on to input less obvious material. For example, an “unpublished draft of a school progress report in a completely different register”:

“Kelsey Piper,” said Claude. (ChatGPT guessed Freddie deBoer. Gemini guessed Duncan Sabien.)

An unpublished fantasy novel produced a similar result, although:

in that case it took more like 500 words for Claude to inform me that it’s the work of Kelsey Piper (whereas ChatGPT flattered me by guessing that I’m real fantasy novelist K.J. Parker).

And finally, “a college application essay I wrote 15 years ago, when my prose style was vastly worse and frankly embarrassing to reread”:

“Kelsey Piper,” said Claude, and in this case, also ChatGPT.

Piper comments:

Right now, today’s AI tools probably can be used to deanonymize any writer who has a large public corpus of writing under their real name and also writes anonymously, unless they have been extremely careful, for years, to make sure that nothing written under their secondary account has the stylistic fingerprints of their primary one. Many academics and industry researchers, for instance, have reported being identified from a draft or in the middle of a chat.

And she concludes:

Whatever goods anonymity ever offered us, we will have to do without them. I don’t want the anonymous posters to all go away and for everyone to frantically delete all their old internet presence before it surfaces, but more than anything, I don’t want them to be surprised.

Those links to other cases of unpublished material being recognized by AI show that Piper’s experience was not a one-off, although the results remain in the realm of anecdata. But even if imperfect, the ability of generative AI to carry out this kind of analysis quickly and often accurately represents an important new option for the well-established field of stylometry. Wikipedia explains:

Stylometry may be used to unmask pseudonymous or anonymous authors, or to reveal some information about the author short of a full identification. Authors may use adversarial stylometry to resist this identification by eliminating their own stylistic characteristics without changing the meaningful content of their communications. It can defeat analyses that do not account for its possibility, but the ultimate effectiveness of stylometry in an adversarial environment is uncertain: stylometric identification may not be reliable, but nor can non-identification be guaranteed; adversarial stylometry’s practice itself may be detectable.

The limitations of stylometry were demonstrated in John Carreyrou’s attempt to reveal the true identity of Bitcoin’s pseudonymous creator, Satoshi Nakamoto, published in The New York Times a few weeks ago. Carreyrou concluded that various real-world coincidences plus linguistic evidence indicated that Bitcoin was created by the 55-year-old British computer scientist Adam Back, something Back denies. Carreyrou’s attempts to use computerized stylometry (not the AI services Piper drew on) were unsatisfactory, and he eventually adopted a more hands-on approach to text analysis, which involved looking at Satoshi’s vocabulary, grammatical hyphenation mistakes and the use of British spellings.

Despite Carreyrou’s lack of success, stylometric analysis by generative AI is likely to become more common in many disciplines for the simple reason it is so quick, easy and cheap to carry out. Even if its results are unreliable, people may find it useful as a stimulus for further investigations. And as we know, the fact that generative AI systems can churn out nonsense hasn’t stopped hundreds of millions of people from using and trusting them anyway.

Follow me @glynmoody on Mastodon and on Bluesky.

Posted on Techdirt - 3 April 2026 @ 01:08pm

Can Agentic AI Coding Tools Finally End Copyright For Software While Re-Inventing Open Source?

Most of the discussions about the impact of the latest generative AI systems on copyright have centered on text, images and video. That’s no surprise, since writers, artists and film-makers feel very strongly about their creations, and members of the public can relate easily to the issues that AI raises for this kind of creativity. But there’s another creative domain that has been massively affected by genAI: software engineering. More and more professional coders are using generative AI to write major elements of their projects for them. Some top engineers even claim that they have stopped coding completely, and now act more as a manager for the AI generation of code, because the available tools are now so powerful. This applies in the world of open source software too. But a recent incident shows that it raises some interesting copyright issues there that are likely to affect the entire software world.

It concerns a project called chardet, “a universal character encoding detector for Python. It analyzes byte strings and returns the detected encoding, confidence score, and language.” A long and detailed post on Ars Technica explains what has happened recently:

The [chardet] repository was originally written by coder Mark Pilgrim in 2006 and released under an LGPL license that placed strict limits on how it could be reused and redistributed.

Dan Blanchard took over maintenance of the repository in 2012 but waded into some controversy with the release of version 7.0 of chardet last week. Blanchard described that overhaul as “a ground-up, MIT-licensed rewrite” of the entire library built with the help of Claude Code to be “much faster and more accurate” than what came before.

Licensing lies at the heart of open source. When Richard Stallman invented the concept of free software, he did so using a new kind of software license, the GPL. This allows anyone to use and modify software released under the GPL, provided they release their own code under the same license. As the above description makes clear, chardet was originally released under the LGPL – one of the GPL variants – but version 7.0 is licensed under the much more permissive MIT license. According to Ars Technica:

Blanchard says he was able to accomplish this “AI clean room” process by first specifying an architecture in a design document and writing out some requirements to Claude Code. After that, Blanchard “started in an empty repository with no access to the old source tree and explicitly instructed Claude not to base anything on LGPL/GPL-licensed code.”

That is, generative AI would appear to allow open source licenses like the GPL to be circumvented by rewriting the code without copying anything directly from the original. That’s possible because AI is now so good at coding that the results can be better than the original, as Blanchard proved with version 7.0 of chardet. And because it is new code, it can be released under any license. In fact, it is quite possible that code produced by genAI is not covered by copyright at all, for the same reason that artistic output created solely by AI can’t be copyrighted. If the license can be changed or simply cancelled in this way, then there is no way to force people to release their own variants only under the GPL, as Stallman intended. Similarly, the incentive for people to contribute their own improvements to the main version is diminished.

The ramifications extend even further. These kind of “AI clean room” implementations could be used to make new versions of any proprietary software. That’s been possible for decades – Stallman’s 1983 GNU project is itself a clean-room version of Unix – but generally requires many skilled coders working for long periods to achieve. The arrival of highly-capable genAI coding tools has brought down the cost by many orders of magnitude, which means it is relatively inexpensive and quick to produce new versions of any software.

In effect, generative AI coding systems make copyright irrelevant for software, both open source and proprietary. That’s because what is important about computer code is not the details of how it is written, but what it does. AI systems can be guided to create drop-in replacements for other software that are functionally identical, but with completely different code underneath.

Companies that license their proprietary software will probably still be able to do so by offering support packages plus the promise that they take legal responsibility for their code in a way that AI-generated alternatives don’t: businesses would pay for a promise of reliability plus the ability to sue someone when things go wrong. But for the open source world these are not relevant. As a result, the latest progress in AI coding seems a serious threat to the underlying development model that has worked well for the last 40 years, and which underpins most software in use today. But a wise post by Salvatore “antirez” Sanfilippo sees opportunities too:

AI can unlock a lot of good things in the field of open source software. Many passionate individuals write open source because they hate their day job, and want to make something they love, or they write open source because they want to be part of something bigger than economic interests. A lot of open source software is either written in the free time, or with severe constraints on the amount of people that are allocated for the project, or – even worse – with limiting conditions imposed by the companies paying for the developments. Now that code is every day less important than ideas, open source can be strongly accelerated by AI. The four hours allocated over the weekend will bring 10x the fruits, in the right hands (AI coding is not for everybody, as good coding and design is not for everybody).

Perhaps a new kind of open source will emerge – Open Source 2.0 – one in which people do not contribute their software patches to a project, as they do today, but instead send their prompts that produce better versions. People might start working directly on the prompts, collaborating on ways to fine tune them. It’s open source hacking but functioning at a level above the code itself.

One possibility is that such an approach could go some way to solving the so-called “Nebraska problem”: the fact that key parts of modern digital infrastructure are underpinned up by “a project some random person in Nebraska has been thanklessly maintaining since 2003”. That person may not receive many more thanks than they have in the past, but with AI assistants constantly checking, rewriting and improving the code, at least the selfless dedication to their project becomes a little less onerous, and thus a little less likely to lead to programmer burn out.

Follow me @glynmoody on Mastodon and on Bluesky. Originally published to Walled Culture.

Posted on Techdirt - 1 April 2026 @ 11:04am

Copyright Industry Continues Its Efforts To Ban VPNs

Last month Walled Culture wrote about an important case at the Court of Justice of the European Union, (CJEU), the EU’s top court, that could determine how VPNs can be used in that region. Clarification in this area is particularly important because VPNs are currently under attack in various ways. For example, last year, the Danish government published draft legislation that many believed would make it illegal to use a VPN to access geoblocked streaming content or bypass restrictions on illegal websites. In the wake of a firestorm of criticism, Denmark’s Minister of Culture assured people that VPNs would not be banned. However, even though references to VPNs were removed from the text, the provisions are so broadly drafted that VPNs may well be affected anyway. Companies too are taking aim at VPNs. Leading the charge are those in France, which have been targeting VPN providers for over a year now. As TorrentFreak reported last February:

Canal+ and the football league LFP have requested court orders to compel NordVPN, ExpressVPN, ProtonVPN, and others to block access to pirate sites and services. The move follows similar orders obtained last year against DNS resolvers.

The VPN Trust Initiative (VTI) responded with a press release opposing what it called a “Misguided Legal Effort to Extend Website Blocking to VPNs”. It warned:

Such blocking can have sweeping consequences that might put the security and privacy of French citizens at risk.

Targeting VPNs opens the door to a dangerous censorship precedent, risking overreach into broader areas of content.

Indeed: if VPN blocks become an option, there will inevitably be more calls to use them for a wider range of material. The VTI also noted that some of its members are considering whether to abandon the French market completely. That could mean people start using less reliable VPN providers, some of which have dubious records when it comes to security and privacy. The incentive for VPNs to pull out of France is increasing. In August last year the Paris Judicial Court ordered top VPN service providers to block more sports streaming domains, and at the beginning of this year, yet more blocking orders were issued to VPNs operating in France. To its credit, one of the VPN providers affected, ProtonVPN, fought back. As reported here by TorrentFreak, the company tried multiple angles:

The VPN provider raised jurisdictional questions and also requested to see evidence that Canal+ owned all the rights at play. However, these concerns didn’t convince the court.

The same applies to Proton’s net neutrality defense, which argued that Article 333-10 of the French sports code, which is at the basis of all blocking orders, violates EU Open Internet Regulation. This defense was too vague, the court concluded, noting that Proton cited the regulation without specifying which provisions were actually breached.

ProtonVPN also argued that forcing a Swiss company to block sites for the French market is a restriction of cross-border trade in services, and that in any case, the blocking measures were “technically unrealizable, costly, and unnecessarily complex.” Despite this valiant defense, the court was unimpressed. At least ProtonVPN was allowed to contest the French court’s ruling. In a similar case in Spain, no such option was given. According to TorrentFreak:

The court orders were issued inaudita parte, which is Latin for “without hearing the other side.” Citing urgency, the Córdoba court did not give NordVPN and ProtonVPN the opportunity to contest the measures before they were granted.

Without a defense, the court reportedly concluded that both NordVPN and ProtonVPN actively advertise their ability to bypass geo-restrictions, citing match schedules in their marketing materials. The VPNs are therefore seen as active participants in the piracy chain rather than passive conduits, according to local media reports.

That’s pretty shocking, and shows once more how biased in favor of the copyright industry the law has become in some jurisdictions: other parties aren’t even allowed to present a defense. It’s a further reason why a definitive ruling from the CJEU on the right of people to use VPNs how they wish is so important.

Alongside these recent court cases, there is also another imminent attack on the use of VPNs, albeit in a slight different way. The UK government has announced wide-ranging plans that aim to “keep children safe online”. One of the ideas the government is proposing is “to age restrict or limit children’s VPN use where it undermines safety protections and changing the age of digital consent.” Although this is presented as a child protection measure, the effects will be much wider. The only way to bring in age restrictions for children is if all adult users of VPNs verify their own age. This inevitably leads to the creation of huge new online databases of personal information that are vulnerable to attack. As a side effect, the UK government’s misguided plans will also bolster the growing attempts by the copyright industry to demonize VPNs – a core element of the Internet’s plumbing – as unnecessary tools that are only used to break the law.

Follow me @glynmoody on Mastodon and on Bluesky. Originally published on WalledCulture.

Posted on Techdirt - 24 March 2026 @ 03:37pm

An Open Training Set For AI Goes Global

As many of the AI stories on Walled Culture attest, one of the most contentious areas in the latest stage of AI development concerns the sourcing of training data. To create high-quality large language models (LLMs) massive quantities of training data are required. In the current genAI stampede, many companies are simply scraping everything they can off the Internet. Quite how that will work out in legal terms is not yet clear. Although a few court cases involving the use of copyright material for training have been decided, many have not, and the detailed contours of the legal landscape remain uncertain.

However, there is an alternative to this “grab it all” approach. It involves using materials that are either in the public domain or released under a “permissive” license that allows LLMs to be trained on them without any problems. There’s plenty of such material online, but its scattered nature puts it at a serious disadvantage compared to downloading everything without worrying about licensing issues. To address that, the Common Corpus was created and released just over a year ago by the French startup Pleias. A press release from the AI Alliance explains the key characteristics of the Common Corpus:

Truly Open: contains only data that is permissively licensed and provenance is documented

Multilingual: mostly representing English and French data, but contains at least 1[billion] tokens for over 30 languages

Diverse: consisting of scientific articles, government and legal documents, code, and cultural heritage data, including books and newspapers

Extensively Curated: spelling and formatting has been corrected from digitized texts, harmful and toxic content has been removed, and content with low educational content has also been removed.

There are five main categories of material: OpenGovernment, OpenCulture, OpenScience, OpenWeb, and OpenSource:

OpenGovernment contains Finance Commons, a dataset of financial documents from a range of governmental and regulatory bodies. Finance Commons is a multimodal dataset, including both text and PDF corpora. OpenGovernment also contains Legal Commons, a dataset of legal and administrative texts. OpenCulture contains cultural heritage data like books and newspapers. Many of these texts come from the 18th and 19th centuries, or even earlier.

OpenScience data primarily comes from publicly available academic and scientific publications, which are most often released as PDFs. OpenWeb contains datasets from YouTube Commons, a dataset of transcripts from public domain YouTube videos, and websites like Stack Exchange. Finally, OpenSource comprises code collected from GitHub repositories which were permissibly licensed.

The initial release contained over 2 trillion tokens – the usual way of measuring the volume of training material, where tokens can be whole words and parts of words. A significant recent update of the corpus has taken that to over 2.267 trillion tokens. Just as important as the greater size, is the wider reach: there are major additions of material from China, Japan, Korea, Brazil, India, Africa and South-East Asia. Specifically, the latest release contains data for eight languages with more than 10 billion tokens (English, French, German, Spanish, Italian, Polish, Greek, Latin) and 33 languages with more than 1 billion tokens. Because of the way the dataset has been selected and curated, it is possible to train LLMs on fully open data, which leads to auditable models. Moreover, as the original press release explains:

By providing clear provenance and using permissibly licensed data, Common Corpus exceeds the requirements of even the strictest regulations on AI training data, such as the EU AI Act. Pleias has also taken extensive steps to ensure GDPR compliance, by developing custom procedures to enable personally identifiable information (PII) removal for multilingual data. This makes Common Corpus an ideal foundation for secure, enterprise-grade models. Models trained on Common Corpus will be resilient to an increasingly regulated industry.

Another advantage for many users is that material with high “toxicity scores” has already been removed, thus ensuring that any LLMs trained on the Common Corpus will have fewer problems in this regard.

The Common Corpus is a great demonstration of the power of openness and permissive copyright licensing, and how they bring benefits that other approaches can’t match. For example: “Common Corpus makes it possible to train models compatible with the Open Source Initiative’s definition of open-source AI, which includes openness of use, meaning use is permitted for ‘any purpose and without having to ask for permission’. ” That fact, along with the multilingual nature of the Common Corpus, would make the latest version a great fit for any EU move to create “public AI” systems, something advocated on this blog a few months back. The French government is already backing the project, as are other organizations supporting openness:

The Corpus was built up with the support and concerted efforts of the AI Alliance, the French Ministry of Culture as part of the prefiguration of the service offering of the Alliance for Language technologies EDIC (ALT-EDIC).

This dataset was also made in partnership with Wikimedia Enterprise and Wikidata/Wikimedia Germany. We’re also thankful to our partner Libraries Without Borders for continuous assistance on extending low resource language support.

The corpus was stored and processed with the generous support of the AI Alliance, Jean Zay (Eviden, Idris), Tracto AI, Mozilla.

The unique advantages of the Common Corpus mean that more governments should be supporting it as an alternative to proprietary systems, which generally remain black boxes in terms of where their training data comes from. Publishers too would also be wise to fund it, since it offers a powerful resource explicitly designed to avoid some of the thorniest copyright issues plaguing the generative AI field today.

Follow me @glynmoody on Mastodon and on Bluesky. Originally published to Walled Culture.

Posted on Techdirt - 13 March 2026 @ 01:07pm

Roblox Rolls Out AI-Powered Real-Time Rephrasing Of Profanity Within Chat

The power of the latest generation of AI systems is such that previously impractical applications are not just possible, but scalable. For example, moving beyond basic early AI text translation tools, it is now possible to use live translation to communicate in another language in real time. For many people that will be a real boon, especially when they are traveling. But here’s something that is likely to prove more controversial: real-time rephrasing of profanity within chat. It’s a new AI-powered feature from Roblox that is designed to “keep gameplay fluid while maintaining civility within chat”:

Roblox is leveraging AI to automatically rephrase profanity. Rather than displaying only hashmarks, filtered text will be translated into more respectful language that remains closer to the user’s original intent. For example, a message that violates Roblox’s profanity policies, such as “Hurry TF up!” would previously have appeared as “####” within experience chat. That will now be rephrased to “Hurry up!” This new layer is designed to maintain civility by rephrasing the language and replacing “stop signs” with real-time guidance.

Specifically:

When a message violates Roblox’s profanity policy, everyone in the chat is notified that the text has been rephrased to keep things civil. While rephrasing reduces some of the disruption in the chat, Roblox’s multilayered safety system remains in effect for more serious behavior. Rephrasing is available exclusively for in-experience chat between age-checked users in similar age groups and is supported in all languages currently available through Roblox’s automatic translation tools.

Alongside this new AI-based capability, Roblox is also tweaking its text filtering system:

Early results from Roblox’s testing show significant improvements in detecting leet-speak, or letters replaced with numbers or symbols, and more sophisticated attempts to bypass filters.

Parents may applaud real-time rephrasing as a way for the service to nudge younger users away from bad language in their interaction with others, without stopping them playing altogether. But it creates a dangerous proof of concept that others may build on, particularly in jurisdictions that want stricter controls on what people say online.

It’s easy to imagine situations where Chinese AI systems, for example, rephrase people’s language on social media in real time to promote “social harmony”. Not only the style but even the content’s details could be subtly changed away from controversy towards conformity. It would be possible for rephrasing to be visible only to others, so the person making a comment might not even be aware that their words were being subverted in this way. Something similar is already happening with Chinese AI chatbots that censor their own answers, without acknowledging that fact. As Chinese AI companies become increasingly important players in the online world, this kind of covert rephrasing by them — and others — is another issue people will need to watch out for in our brave new AI world.

Follow me @glynmoody on Bluesky and on Mastodon.

Posted on Techdirt - 23 February 2026 @ 03:08pm

How Copyright Litigation Over Anne Frank’s Diary Could Impact The Fate Of VPNs In The EU

“The Diary of a Young Girl” is a Dutch language diary written by the young Jewish writer Anne Frank while she was in hiding for two years with her family during the Nazi occupation of the Netherlands. Although the diary and Anne Frank’s death in the Bergen-Belsen concentration camp are well known, few are aware that the text has a complicated copyright history – one that could have important implications for the legal status and use of Virtual Private Networks (VPNs) in the EU. TorrentFreak explains the copyright background:

These copyrights are controlled by the Swiss-based Anne Frank Fonds, which was the sole heir of Anne’s father, Otto Frank. The Fonds states that many print versions of the diary remain protected for decades, and even the manuscripts are not freely available everywhere.

In the Netherlands, for example, certain sections of the manuscripts remain protected by copyright until 2037, even though they have entered the public domain in neighboring countries like Belgium.

A separate foundation, the Netherlands-based Anne Frank Stichting, wanted to publish a scholarly edition of Anne Frank’s writing, at least in those parts of the world where her diary was in the public domain:

To navigate these conflicting laws, the Dutch Anne Frank Stichting published a scholarly edition online using “state-of-the-art” geo-blocking to prevent Dutch residents from accessing the site. Visitors from the Netherlands and other countries where the work is protected are met with a clear message, informing them about these access restrictions.

However, the Anne Frank Fonds was unhappy with this approach, and took legal action. Its argument was that such geo-blocking could be circumvented with VPNs, and so its copyrights in the Netherlands could be infringed upon by those using VPNs. The lower courts in the Netherlands dismissed this argument, and the case is now before the Dutch Supreme Court. Beyond the specifics of the Anne Frank scholarly edition, there are important issues regarding the use of VPNs to get around geo-blocking. Because of the potential knock-on effect the ruling in this case will have on EU law, the Dutch Supreme Court has asked for guidance from the EU’s top court, the Court of Justice of the European Union (CJEU).

The CJEU has yet to rule on the issues raised. But one of the court’s advisors, Advocate General Rantos, has published a preliminary opinion, as is normal in such cases. Although that advice is not binding on the CJEU, it often provides some indication as to how the court may eventually decide. On the main issue of whether the ability of people to circumvent geo-blocking is a problem, Rantos writes:

the fact that users manage to circumvent a geo-blocking measure put in place to restrict access to a protected work does not, in itself, mean that the entity that put the geo-blocking in place communicates that work to the public in a territory where access to it is supposed to be blocked. Such an interpretation would make it impossible to manage copyright on the internet on a territorial basis and would mean that any communication to the public on the internet would be global.

Moreover:

As the [European] Commission pointed out in its written observations, the holder of an exclusive right in a work does not have the right to authorise or prohibit, on the basis of the right granted to it in one Member State, communication to the public in another Member State in which that right has ceased to have effect.

Or, more succinctly: “service providers in the public domain country cannot be subject to unreasonable requirements”. That’s a good, common-sense view. But perhaps just as important is the following comment by Rantos regarding the use of VPNs to circumvent geo-blocking:

as the Commission points out in its observations, VPN services are legally accessible technical services which users may, however, use for unlawful purposes. The mere fact that those or similar services may be used for such purposes is not sufficient to establish that the service providers themselves communicate the protected work to the public. It would be different if those service providers actively encouraged the unlawful use of their services.

That’s an important point at a time when VPNs are under attack from some governments because of concerns about possible copyright infringement by those using them.

The hope has to be that the CJEU will agree with its Advocate General’s sensible and fair analysis, and will rule accordingly. But there is another important aspect to this story. The basic issue is that the Anne Frank Stichting wants to make its scholarly edition of Anne Frank’s diary available as widely as possible. That seems a laudable aim, since it will increase understanding and appreciation of the young woman’s remarkable diary by publishing an academically rigorous version. And yet the Anne Frank Fonds has taken legal action to stop that move, on the grounds that it would represent an infringement of its intellectual monopoly in some parts of Frank’s work, in some parts of the world. The current dispute is another clear example of how copyright has become for some an end in itself, more important than the things that it is supposed to promote.

Follow me @glynmoody on Mastodon and on Bluesky. Republished from Walled Culture.

Posted on Techdirt - 19 February 2026 @ 01:30pm

Wikipedia Grapples With New Challenges From AI

Wikipedia celebrated its 25th birthday last month. Given the centrality of Wikipedia to so much activity online, it is hard to remember (or to imagine, for those who are younger) a time without Wikipedia. The latest statistics are impressive:

  • Wikipedia is viewed nearly 15 billion times every month.
  • Wikipedia contains over 65 million articles across more than 300 languages.
  • Wikipedia is edited by nearly 250,000 editors every month around the world. Editors are defined by one edit or more every month; only editors with a username are counted.
  • Wikipedia is accessed by over 1.5 billion unique devices every month.

That’s testimony to the global nature of Wikipedia. But there’s something else, not mentioned there, that is of great relevance to this blog: the fact that every one of those 65 million articles is made available under a generous license – the Creative Commons Attribution-ShareAlike 4.0 license, to be precise. That means sharing and re-use are encouraged, in contrast to most material online, where copyright is fiercely enforced. Wikipedia is living proof that giving away things by relying on volunteers and donations – the “true fans” approach – works, and on a massive scale. Anil Dash puts it well in a post celebrating Wikipedia’s 25th anniversary:

Whenever I worry about where the Internet is headed, I remember that this example of the collective generosity and goodness of people still exists. There are so many folks just working away, every day, to make something good and valuable for strangers out there, simply from the goodness of their hearts. They have no way of ever knowing who they’ve helped. But they believe in the simple power of doing a little bit of good using some of the most basic technologies of the internet. Twenty-five years later, all of the evidence has shown that they really have changed the world.

However, Wikipedia is today facing perhaps its greatest challenge, which comes from the new generation of AI services. They are problematic for Wikipedia in two main ways. The first, ironically, is because it is widely recognized that Wikipedia’s holdings represent some of the highest-quality training materials available. In a post explaining why, “in the AI era, Wikipedia has never been more valuable”, the Wikimedia Foundation writes:

AI cannot exist without the human effort that goes into building open and nonprofit information sources like Wikipedia. That’s why Wikipedia is one of the highest-quality datasets in the world for training AI, and when AI developers try to omit it, the resulting answers are significantly less accurate, less diverse, and less verifiable.

That recognition is welcome, but comes at a price. It means that every AI company as a matter of course wants to download the entire Wikipedia corpus to be used for training its models. That has led to irresponsible behavior by some companies, when their scraping tools download pages from Wikipedia with no consideration for the resources they are using for free, or the collateral damage they are causing to other users in terms of slower responses.

Trying to stop companies drawing on this unique resource is futile; recognizing this, Wikimedia Foundation has come up with an alternative approach: Wikimedia Enterprise, “a first-of-its-kind commercial product designed for companies that reuse and source Wikipedia and Wikimedia projects at a high volume”. In 2022, its first customers were Google and the Internet Archive, and last month, Wikimedia Enterprise announced that Amazon, Meta, Microsoft, Mistral AI, and Perplexity have also signed. That’s important for a couple of reasons. It means that many of the biggest AI players will download Wikipedia articles more efficiently. It also means that the Wikipedia project will receive funding for its work.

This new money is crucial if Wikipedia is to remain a high quality resource. And that is precisely why every generative AI company that uses Wikipedia posts for training should – if only out of self-interest – pay to do so. What is happening here echoes something this blog suggested back in May 2024: that AI companies should pay artists to create new works, and give away the results, because fresh training material is vital. Helping to pay for Wikipedia to create more high-quality articles that are freely available to all is a variation on that theme.

The other problem that generative AI causes Wikipedia is more subtle. The Wikimedia Foundation explains that alongside financial support, the project needs proper attribution:

Attribution means that generative AI gives credit to the human contributions that it uses to create its outputs. This maintains a virtuous cycle that continues those human contributions that create the training data that these new technologies rely on. For people to trust information shared on the internet, platforms should make it clear where the information is sourced from and elevate opportunities to visit and participate in those sources. With fewer visits to Wikipedia, fewer volunteers may grow and enrich the content, and fewer individual donors may support this work.

Without fresh volunteers, Wikipedia will wither and become less valuable. That’s terrible for the world, but it is also bad for generative AI companies. So, again, it makes sense for them to provide proper attribution in their outputs. That requirement has become even more pressing in the light of a new development. According to tests carried out by the Guardian:

The latest model of ChatGPT has begun to cite Elon Musk’s Grokipedia as a source on a wide range of queries, including on Iranian conglomerates and Holocaust deniers, raising concerns about misinformation on the platform.

That’s potentially problematic because of how Grokipedia creates its entries. Research last year found that:

Grokipedia articles are substantially longer and contain significantly fewer references per word. Moreover, Grokipedia’s content divides into two distinct groups: one that remains semantically and stylistically aligned with Wikipedia, and another that diverges sharply. Among the dissimilar articles, we observe a systematic rightward shift in the political bias of cited news sources, concentrated primarily in entries related to politics, history, and religion. These findings suggest that AI-generated encyclopedic content diverges from established editorial norms-favouring narrative expansion over citation-based verification.

If leading chatbots starts drawing on Grokipedia routinely for their answers, it is less likely that there are independent sources where the information can be checked, something generally possible with Wikipedia. It therefore becomes even more urgent for generative AI systems to provide attribution, so at least users know where information is coming from, and whether there are likely to be further resources that confirm a chatbot’s claims. Not everyone will want to do that, but it is important to offer it as an option.

Wikipedia at 25 is an amazing achievement in multiple ways, one of which includes serving as a demonstration that material can be given away for free, supported directly by users, and on a global scale. It would be a tragedy if the current enthusiasm for generative AI systems led to that resource being harmed and even destroyed. A world without Wikipedia would be a poorer world indeed.

Follow me @glynmoody on Mastodon and on Bluesky. Republished from Walled Culture.

Posted on Techdirt - 4 February 2026 @ 01:32pm

OpenAI’s New Scientific Writing And Collaboration Workspace ‘Prism’ Raises Fears Of Vibe-Coded Academic AI Slop

It is no secret that large language models (LLMs) are being used routinely to modify and even write scientific papers. That’s not necessarily a bad thing: LLMs can help produce clearer texts with stronger logic, not least when researchers are writing in a language that is not their mother tongue. More generally, a recent analysis in Nature magazine, reported by Science magazine, found that scientists embracing AI — of any kind — “consistently make the biggest professional strides”:

AI adopters have published three times more papers, received five times more citations, and reach leadership roles faster than their AI-free peers.

But there is also a downside:

Not only is AI-driven work prone to circling the same crowded problems, but it also leads to a less interconnected scientific literature, with fewer studies engaging with and building on one another.

Another issue with LLMs, that of “hallucinated citations,” or “HalluCitations,” is well known. More seriously, entire fake publications can be generated using AI, and sold by so-called “paper mills” to unscrupulous scientists who wish to bolster their list of publications to help their career. In the field of biomedical research alone, a recent study estimated that over 100,000 fake papers were published in 2023. Not all of those were generated using AI, but progress in LLMs has made the process of creating fake articles much simpler.

Fake publications generated using LLMs are often obvious because of their lack of sophistication and polish. But a new service from OpenAI, called Prism, is likely to eliminate such easy-to-spot signs, by adding AI support to every aspect of writing a scientific paper:

Prism is a free workspace for scientific writing and collaboration, with GPT‑5.2⁠—our most advanced model for mathematical and scientific reasoning—integrated directly into the workflow.

It brings drafting, revision, collaboration, and preparation for publication into a single, cloud-based, LaTeX-native workspace. Rather than operating as a separate tool alongside the writing process, GPT‑5.2 works within the project itself—with access to the structure of the paper, equations, references, and surrounding context.

It includes a number of features that make creating complex — and fake — papers extremely easy:

  • Search for and incorporate relevant literature (for example, from arXiv) in the context of the current manuscript, and revise text in light of newly identified related work
  • Create, refactor, and reason over equations, citations, and figures, with AI that understands how those elements relate across the paper
  • Turn whiteboard equations or diagrams directly into LaTeX, saving hours of time manipulating graphics pixel-by-pixel

There is even voice-based editing, allowing simple changes to be made without the need to write anything. But scientists are already worried that the power of OpenAI’s Prism will make a deteriorating situation worse. As an article on Ars Technica explains:

[Prism] has drawn immediate skepticism from researchers who fear the tool will accelerate the already overwhelming flood of low-quality papers into scientific journals. The launch coincides with growing alarm among publishers about what many are calling “AI slop” in academic publishing.

One field that is already plagued by AI slop is AI itself. An FT article on the topic points to an interesting attempt by the International Conference on Learning Representations (ICLR), a major gathering of researchers in the world of machine learning, to tackle this problem with punitive measures against authors and reviewers who violate the ICLR’s policies on LLM-generated material. For example:

Papers that make extensive usage of LLMs and do not disclose this usage will be desk rejected [that is, without sending them out for external peer review]. Extensive and/or careless LLM usage often results in false claims, misrepresentations, or hallucinated content, including hallucinated references. As stated in our previous blog post: hallucinations of this kind would be considered a Code of Ethics violation on the part of the paper’s authors. We have been desk -rejecting, and will continue to desk -reject, any paper that includes such issues.

Similarly:

reviewers [of submitted papers] are responsible for the content they post. Therefore, if they use LLMs, they are responsible for any issues in their posted review. Very poor quality reviews that feature false claims, misrepresentations or hallucinated references are also a code of ethics violation as expressed in the previous blog post. As such, reviewers who posted such poor quality reviews will also face consequences, including the desk rejection of their [own] submitted papers.

It is clearly not possible to stop scientists from using AI tools to check and improve their papers, nor should this be necessary, provided authors flag up such usage, and no errors are introduced as a result. A policy of the kind adopted by the ICLR requiring transparency about the extent to which AI has been used seems a sensible approach in the face of increasingly sophisticated tools like OpenAI’s Prism.

Follow me @glynmoody on  on Bluesky and Mastodon.

Posted on Techdirt - 29 December 2025 @ 03:31pm

How Generative AI Is Enabling More Connections With True Fans

Walled Culture has written a number of times about the true fans approach – the idea that creators can be supported directly and effectively by the people who love their work. As Walled Culture the book explains (available as a free ebook), one of the earliest and best expositions of the concept came from Kevin Kelly, former Executive Editor at Wired magazine, in an essay he wrote originally in 2008. The true fans idea is sometimes dismissed as simply selling branded t-shirts to supporters. That may have been true decades ago, but things have moved on. For example, Universal Music Group has recently opened retail locations that cater specifically for true fans. In addition to shops in Tokyo and Madrid, there are new outlets in New York and London. Here’s what the latter will offer, as reported by Music Business Worldwide:

Located in Camden Market, the London-based space will “serve as a creative hub where music, fashion, and design collide,” UMG said.

The announcement added that the shop was “designed to capture Camden’s rebellious spirit and deep musical roots”.

The store will feature exclusive artist collections, immersive installations, and live performances, along with a Vinyl Lounge, DJ booth, and recording studio-inspired Sound Room that “allows fans to experience music like never before”.

That is a fairly conventional extension of the “selling branded t-shirts to supporters” idea. A post on the Midia Research blog points out a more radical development in the true fans space involving the latest generative AI technology:

AI is best considered as an accelerant rather than something entirely new, intensifying pre-existing trends. AI music absolutely fits this trend. Over the course of the last decade – including a super-charged COVID bump – accessible music tech has enabled ever-more people to become music creators. AI simply lowered the barriers to entry even further. The debate over whether a text prompt constitutes creativity will continue to run (just like the same debate still runs for sampling), but what is clear is that more people are now making music because of AI.

Thanks to genAI, true fans are not limited to a passive role. They can actively participate in the artistic ecosystem brought into being by their musical heroes, through the creation of new works based on and extending the originals they love. The fanfic world has been doing this for many years, so it is no surprise to find the use of generative AI there even more advanced there than in the world of music. For example, the DreamGen site lists no less than nine “AI fanfic generators”, including its own. It offers a good description of how these systems work:

1. You give it a prompt: This could be something like “Harry Potter and Hermione go on a space adventure” or “Naruto meets Spider-Man in New York.”

2. The AI takes over: It uses its knowledge of language and storytelling to write a story based on your idea. It fills in the details, such as dialogue, action, emotions,and plot twists.

3. You can guide it: Want more romance? More drama? A surprise ending? You can tweak the prompt or add instructions, and the AI will adjust the story.

4. You get a full fanfic: Some tools write it all at once, others let you build it paragraph by paragraph so you can shape the story as it goes.

As that indicates, the new AI-based fanfic generators are so easy to use, anyone can use them. The only limit is the imagination and the ability to put that into words. That’s an incredible democratization of creativity that takes the idea of participatory fandom to the next level. And, of course, it can be applied in other domains too, such as “fan art”, which Wikipedia defines as follows:

Fan art or fanart is artwork created by fans of a work of fiction or celebrity depicting events, character, or other aspect of the work. As fan labor, fan art refers to artworks that are not created, commissioned, nor endorsed by the creators of the work from which the fan art derives.

As with other uses of genAI, this raises questions of copyright, some of which have already found their way to court. Perhaps surprisingly, Disney has just announced its embrace of this use of AI by fans, in a partnership with OpenAI:

The Walt Disney Company and OpenAI have reached an agreement for Disney to become the first major content licensing partner on Sora, OpenAI’s short-form generative AI video platform, bringing these leaders in creativity and innovation together to unlock new possibilities in imaginative storytelling.

As part of this new, three-year licensing agreement, Sora will be able to generate short, user-prompted social videos that can be viewed and shared by fans, drawing from a set of more than 200 animated, masked and creature characters from Disney, Marvel, Pixar and Star Wars, including costumes, props, vehicles, and iconic environments. In addition, ChatGPT Images will be able to turn a few words by the user into fully generated images in seconds, drawing from the same intellectual property. The agreement does not include any talent likenesses or voices.

There’s a billion-dollar investment by Disney in OpenAI, as well as the following:

OpenAI and Disney will collaborate to utilize OpenAI’s models to power new experiences for Disney+ subscribers, furthering innovative and creative ways to connect with Disney’s stories and characters.

Presumably, Disney hopes to gain more Disney+ subscribers and drive more revenues with these short-form, fan-generated videos, plus whatever “creative ways” of using AI that it comes up with. OpenAI, meanwhile, gains some handy investment, and a showcase for its Sora genAI video platform.

Although this deal is a welcome sign that some major copyright companies are starting to think imaginatively and positively about genAI, and how it can actually boost profits, the new service will doubtless be rather limited, not least in terms of what kind of videos can generated. The press release emphasises:

OpenAI and Disney have affirmed a shared commitment to maintaining robust controls to prevent the generation of illegal or harmful content, to respect the rights of content owners in relation to the outputs of models, and to respect the rights of individuals to appropriately control the use of their voice and likeness.

That means that there will always be room for edgier, smaller sites producing fanfic, fan art and fan videos that don’t worry about things like good taste or copyright. As more fans discover the delights of building on and extending the creative ideas of their idols in novel ways using genAI, we can expect a corresponding rise in the number of legal actions trying to stop them doing so.

Follow me @glynmoody on Mastodon and on Bluesky. Originally posted to Walled Culture.

More posts from Glyn Moody >>