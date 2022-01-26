Google Drive's Autodetector For Copyright Infringement Is Locking Up Nearly Empty Files
We've talked at length about the issues surrounding automated copyright infringement "bots" and how often those bots get the primary question they're tagged with wrong. Examples of this are legion: Viacom's bot takes down a Star Trek panel discussion, all kinds of bots disrupted the DNC's livestream of its convention, and one music distributor's bot firing off DMCA notices to, well, everyone. Google itself has reported that nearly 100% of the DMCA notices it gets are just bot-generated buckshot.
But Google isn't the savior here either. The company also uses automated systems for detecting copyright infringement and, at least in the case of Google Drive, those automated systems occasionally suck out loud at their job.
This week, Assistant Professor at Michigan State University, Dr. Emily Dolson, Ph.D. reported seeing some odd behavior when using Google Drive. One of the files in Dolson's Google Drive, 'output04.txt' was nearly empty—with nothing other than the digit '1' inside it.
But according to Google, this file violated the company's "Copyright Infringement policy" and was hence flagged. And what's worse is, the warning sent to the professor ended with "A review cannot be requeste for this restriction."
If your bot thinks a single digit is somehow copyright infringement, then your bot is a bad bot and should be taken behind the woodshed and humanely sent to bot-heaven where it can run and frolic with all the other bots. Now, to be fair, there is an open question in this case as to whether the filepath names that were chosen somehow were what was getting flagged. And, sure, maybe that happened. But it doesn't really change the point: a bot thought a file that contained a single integer was copyright infringement.
That being said, other Drive users have reproduced this, calling into the question the filepath theory.
Dr. Chris Jefferson, Ph.D., an AI and mathematics researcher at the University of St Andrews, was also able to reproduce the issue when uploading multiple computer-generated files to Drive. Jefferson generated over 2,000 files, each containing just a number between -1000 and 1000.
The files containing the digits 173, 174, 186, 266, 285, 302, 336, 451, 500, and 833 were shortly flagged by Google Drive for copyright infringement.
Again, this sucks. For what it's worth, Google has finally responded and, despite the notices indicating there was no way to dispute the bot's findings, has been sharing out links to do exactly that. But that isn't really the point. This is base-level stuff here: having a system that operates this poorly means you have a system that never should have been in production to begin with. Particularly, frankly, when that system is operating as personal file storage for many, many people.
And yet, if they axe this system, you better believe copyright maximalists will cry foul and say Google is trying to let piracy run rampant. It’s a can’t-win situation for Google—one that, due to its own general disinterest in properly standing up for users against false copyright/DMCA claims, it has only made worse for itself over the years.
Re:
Google has stood up for it's users against false DMCA claims before. Here's proof.
It's unclear as to whether this anecdote is a diamond in the rough or a drop in the ocean, though.
This has been said repeatedly: there is no satisfying the copyright maximalists, so don't even try. It only encourages them. Every inch you give them is another mile they will be asking next.
That particular filter only applies if the person tries to share a file, and only stops them sharing it. Sucks though if you are trying to hand in coursework.
This is bloody high art, and i'm going to my output04.txt t-shirt from the lobby straight away.
Great, someone copyrighted the number 1.
(Along with 173, 174, 186, 266, 285, 302, 336, 451, 500, and 833 apparently.)
Next, I imagine math students during tests...
"What is cos(0)?"
"I can't answer this question for fear of violating someone's copyright."
Seriously, bots can be useful, but - at least until they are somewhat reliable and able to understand context - they should never be used as anything more than an alerting tool. Currently, they are definitely not good enough for automated take-downs.
Obviously, if they can mess up such obvious cases, you can only imagine that they will also strike less obvious but still perfectly legal files.
This will never happen. Bots can’t understand context because contexts can change based on a number of variables. They’re good for broad-based “sledgehammer” moderation efforts, but they’ll never be able to handle the kind of narrower “scalpel” moderation that requires looking at context.
Re:
stares in 3 minutes of silence & white noise
I would be very much interested to know who claims to hold the copyright on 266, 500, and 1.
Re:
Don't expect the copyright office to help you with that. Hell, they're lucky if they can find their own asses with both hands, a flashlight, and a multi-color map.
But my guess is that no one holds any such copyrights. That in fact, Google's own bots were trained with numerous test case scenarios, and those were never removed from the "live" database.
Re: copyright on 266, 500, and 1.
Well, 266 is a pantone colour that is remarkably close to a certain litigious chocolate manufacturer's "holy IP".
Pantone 500 could be the "salmon" of doubt.
Pantone 1 is a "shade of grey"
And selectively, 173 is 160 + 13, or AD in hexadecimal, which when converted to ASCII is LF and CR - or what you get when you press the ENTER key. 186 is an Intel processor suffix; 285 is an Intel processor with one of the pins missing; 302 is the cubic inch size of some V8 engines; 451 is associated with burning paper; and 833 is LEET for bee.
Some musings
You can't copyright a number. Can. Not. Do. (But...)
Sometimes, you don't have to. Remember DeCSS? It wasn't a copyright issue (which might, for instance, have protection through 17 usc 512 or section 230). No, it was a DMCA charge.
Of course, sometimes the Streisand Effect comes into play and you get a t-shirt or a song.
As the signature for one of my email accounts says: "Artificial intelligence can never overcome natural stupidity."
slaps the editor, hands out a d
"A review cannot be requeste for this restriction."
This is the most wrong portion of this.
I mean random numbers triggering the bot is bad, but the fact there is no redress or recourse to challenge it horrifying.
Broken tools & broken systems aside, removing the ability to challenge the "findings" when you spot what you think is an error is wrong on so many levels.
But then the entire system is lopsided in believing anyone who can claim they hold a copyright would never ever fib (despite the huge pile of cases where scammers are making bank) & that it is to onerous or impossible to challenge the claims when even a child could see its not infringing.
Copyrighting numbers
You can't copyright a number? What is a CD, but a large number inscribed on a plastic disk?
And what about Cage's 4'33". Is that copyrightable?
I remember a mathematician arguing that all numbers are interesting. I wonder if that applies to copyrightability?
