Google Drive's Autodetector For Copyright Infringement Is Locking Up Nearly Empty Files
from the whoopsie dept
We’ve talked at length about the issues surrounding automated copyright infringement “bots” and how often those bots get the primary question they’re tagged with wrong. Examples of this are legion: Viacom’s bot takes down a Star Trek panel discussion, all kinds of bots disrupted the DNC’s livestream of its convention, and one music distributor’s bot firing off DMCA notices to, well, everyone. Google itself has reported that nearly 100% of the DMCA notices it gets are just bot-generated buckshot.
But Google isn’t the savior here either. The company also uses automated systems for detecting copyright infringement and, at least in the case of Google Drive, those automated systems occasionally suck out loud at their job.
This week, Assistant Professor at Michigan State University, Dr. Emily Dolson, Ph.D. reported seeing some odd behavior when using Google Drive. One of the files in Dolson’s Google Drive, ‘output04.txt’ was nearly empty—with nothing other than the digit ‘1’ inside it.
But according to Google, this file violated the company’s “Copyright Infringement policy” and was hence flagged. And what’s worse is, the warning sent to the professor ended with “A review cannot be requeste for this restriction.”
If your bot thinks a single digit is somehow copyright infringement, then your bot is a bad bot and should be taken behind the woodshed and humanely sent to bot-heaven where it can run and frolic with all the other bots. Now, to be fair, there is an open question in this case as to whether the filepath names that were chosen somehow were what was getting flagged. And, sure, maybe that happened. But it doesn’t really change the point: a bot thought a file that contained a single integer was copyright infringement.
That being said, other Drive users have reproduced this, calling into the question the filepath theory.
Dr. Chris Jefferson, Ph.D., an AI and mathematics researcher at the University of St Andrews, was also able to reproduce the issue when uploading multiple computer-generated files to Drive. Jefferson generated over 2,000 files, each containing just a number between -1000 and 1000.
The files containing the digits 173, 174, 186, 266, 285, 302, 336, 451, 500, and 833 were shortly flagged by Google Drive for copyright infringement.
Again, this sucks. For what it’s worth, Google has finally responded and, despite the notices indicating there was no way to dispute the bot’s findings, has been sharing out links to do exactly that. But that isn’t really the point. This is base-level stuff here: having a system that operates this poorly means you have a system that never should have been in production to begin with. Particularly, frankly, when that system is operating as personal file storage for many, many people.