A Numerical Exploration Of How The EU's Article 13 Will Lead To Massive Censorship
from the it's-not-good-folks dept
One of the key talking points from those in favor of Article 13 in the EU Copyright Directive is that people who claim it will lead to widespread censorship are simply making it up. We’ve explained many times why this is untrue, and how any time you put in place a system for taking down content, tons of perfectly legitimate content gets caught up in it. Some of this is from malicious takedowns, but much of it is just because algorithms make mistakes. And when you make mistakes at scale, bad things happen. Most of you are familiar with the concept of “Type 1” and “Type 2” errors in statistics. These can be more simply described as false positives and false negatives. Over the weekend, Alec Muffett decided to put together a quick “false positive” emulator to show how much of an impact this would have at scale and tweeted out quite a thread, that has since been un-threaded into a webpage for easier reading. In short, at scale, the “false positive” problem is pretty intense. A ton of non-infringing content is likely to get swept up in the mess.
Using a baseline of 10 million piece of content and a much higher than reality level of accuracy (99.5%), and an assumption that 1 in 10,000 items are “bad” (i.e., “infringing”) you end up with a ton of legitimate content taken down to stop just a bit of infringement:
So basically in an effort to stop 1,000 pieces of infringing content, you’d end up pulling down 50,000 pieces of legitimate content. And that’s with an incredible (and unbelievable) 99.5% accuracy rate. Drop the accuracy rate to a still optimistic 90%, and the results are even more stark:
Now we’re talking about pulling down one million legitimate, non-infringing pieces of content in pursuit of just 1,000 infringing ones (many of which the system still misses).
Of course, I can hear the howls from the usual crew, complaining that the 1 in 10,0000 number is unrealistic (it’s not). Lots of folks in the legacy copyright industries want to pretend that the only reason people use big platforms like YouTube and Facebook is to upload infringing material, but that’s laughably wrong. It’s actually a very, very small percentage of such content. And, remember, of course, Article 13 will apply to basically any platform that hosts content, even ones that are rarely used for infringement.
But, just to humor those who think infringement is a lot more widespread than it really is, Muffett also ran the emulator with a scenario in which 1 out of every 500 pieces of content are infringing and (a still impossible) 98.5% accuracy. It’s still a disaster:
In that totally unrealistic scenario with a lot more infringement than is actually happening and with accuracy rates way above reality, you still end up pulling down 150,000 non-infringing items… just to stop less than 20,000 infringing pieces of content.
Indeed, Muffett then figures out that with a 98.5% accuracy rate, if a platform has 1 in 67 items as infringing, at that point you’ll “break even” in terms of the numbers of non-infringing content (147,000) that is caught by the filter, to catch an equivalent amount of infringing content. But that still means censoring nearly 150,000 pieces of non-infringing content.
This is one of the major problems that people don’t seem to comprehend when they talk about filtering (or even human moderating) content at scale. Even at impossibly high accuracy rates, a “small” percentage of false positives leads to a massive amount of non-infringing content being taken offline.
Perhaps some people feel that this is acceptable “collateral damage” to deal with the relatively small amount of infringement on various platforms, but to deny that it will create widespread censorship of legitimate and non-infringing content is to deny reality.