Google Drive's Autodetector For Copyright Infringement Is Locking Up Nearly Empty Files

from the whoopsie dept

We’ve talked at length about the issues surrounding automated copyright infringement “bots” and how often those bots get the primary question they’re tagged with wrong. Examples of this are legion: Viacom’s bot takes down a Star Trek panel discussion, all kinds of bots disrupted the DNC’s livestream of its convention, and one music distributor’s bot firing off DMCA notices to, well, everyone. Google itself has reported that nearly 100% of the DMCA notices it gets are just bot-generated buckshot.

But Google isn’t the savior here either. The company also uses automated systems for detecting copyright infringement and, at least in the case of Google Drive, those automated systems occasionally suck out loud at their job.

This week, Assistant Professor at Michigan State University, Dr. Emily Dolson, Ph.D. reported seeing some odd behavior when using Google Drive. One of the files in Dolson’s Google Drive, ‘output04.txt’ was nearly empty—with nothing other than the digit ‘1’ inside it.

But according to Google, this file violated the company’s “Copyright Infringement policy” and was hence flagged. And what’s worse is, the warning sent to the professor ended with “A review cannot be requeste for this restriction.”

If your bot thinks a single digit is somehow copyright infringement, then your bot is a bad bot and should be taken behind the woodshed and humanely sent to bot-heaven where it can run and frolic with all the other bots. Now, to be fair, there is an open question in this case as to whether the filepath names that were chosen somehow were what was getting flagged. And, sure, maybe that happened. But it doesn’t really change the point: a bot thought a file that contained a single integer was copyright infringement.

That being said, other Drive users have reproduced this, calling into the question the filepath theory.

Dr. Chris Jefferson, Ph.D., an AI and mathematics researcher at the University of St Andrews, was also able to reproduce the issue when uploading multiple computer-generated files to Drive. Jefferson generated over 2,000 files, each containing just a number between -1000 and 1000.

The files containing the digits 173, 174, 186, 266, 285, 302, 336, 451, 500, and 833 were shortly flagged by Google Drive for copyright infringement.

Again, this sucks. For what it’s worth, Google has finally responded and, despite the notices indicating there was no way to dispute the bot’s findings, has been sharing out links to do exactly that. But that isn’t really the point. This is base-level stuff here: having a system that operates this poorly means you have a system that never should have been in production to begin with. Particularly, frankly, when that system is operating as personal file storage for many, many people.

Filed Under: , , , , , , ,
Companies: google

Rate this comment as insightful
Rate this comment as funny
You have rated this comment as insightful
You have rated this comment as funny
Flag this comment as abusive/trolling/spam
You have flagged this comment
The first word has already been claimed
The last word has already been claimed
Insightful Lightbulb icon Funny Laughing icon Abusive/trolling/spam Flag icon Insightful badge Lightbulb icon Funny badge Laughing icon Comments icon

Comments on “Google Drive's Autodetector For Copyright Infringement Is Locking Up Nearly Empty Files”

Subscribe: RSS Leave a comment
Stephen T. Stone (profile) says:

having a system that operates this poorly means you have a system that never should have been in production to begin with

And yet, if they axe this system, you better believe copyright maximalists will cry foul and say Google is trying to let piracy run rampant. It’s a can’t-win situation for Google⁠—one that, due to its own general disinterest in properly standing up for users against false copyright/DMCA claims, it has only made worse for itself over the years.

Rekrul says:

Re: Re:

It’s a can’t-win situation for Google⁠—one that, due to its own general disinterest in properly standing up for users against false copyright/DMCA claims, it has only made worse for itself over the years.

Correction: Its own general disinterest in paying humans to look into things rather than simply relying on AI to do everything.

Now, before anyone responds and tells me that it’s impossible for humans to check everything, I’m not suggesting that. What I am suggesting is that Google should be willing to pay a staff to look into situations like these when the AI screws up. Instead, they’ve automated the process and getting a human involved requires you to be able to focus some negative media attention on the company.

I can fully understand using AI to help police its sites/services, but if you’re going to make products that are used by the general public, you also need to invest in actual staff who are going to make sure that said AI is working properly and not screwing over your users.

How long would your local supermarket last if it was all automated, regularly screwed up, and all your disputes were rejected by a computer?

This comment has been deemed funny by the community.
Wyrm (profile) says:

Great, someone copyrighted the number 1.
(Along with 173, 174, 186, 266, 285, 302, 336, 451, 500, and 833 apparently.)
Next, I imagine math students during tests…
"What is cos(0)?"
"I can’t answer this question for fear of violating someone’s copyright."

Seriously, bots can be useful, but – at least until they are somewhat reliable and able to understand context – they should never be used as anything more than an alerting tool. Currently, they are definitely not good enough for automated take-downs.
Obviously, if they can mess up such obvious cases, you can only imagine that they will also strike less obvious but still perfectly legal files.

sumgai (profile) says:

Re: Re:

Don’t expect the copyright office to help you with that. Hell, they’re lucky if they can find their own asses with both hands, a flashlight, and a multi-color map.

But my guess is that no one holds any such copyrights. That in fact, Google’s own bots were trained with numerous test case scenarios, and those were never removed from the "live" database.

Bobvious says:

Re: copyright on 266, 500, and 1.

Well, 266 is a pantone colour that is remarkably close to a certain litigious chocolate manufacturer’s "holy IP".

Pantone 500 could be the "salmon" of doubt.

Pantone 1 is a "shade of grey"

And selectively, 173 is 160 + 13, or AD in hexadecimal, which when converted to ASCII is LF and CR – or what you get when you press the ENTER key. 186 is an Intel processor suffix; 285 is an Intel processor with one of the pins missing; 302 is the cubic inch size of some V8 engines; 451 is associated with burning paper; and 833 is LEET for bee.

Anonymous Coward says:

Some musings

You can’t copyright a number. Can. Not. Do. (But…)

Sometimes, you don’t have to. Remember DeCSS? It wasn’t a copyright issue (which might, for instance, have protection through 17 usc 512 or section 230). No, it was a DMCA charge.

Of course, sometimes the Streisand Effect comes into play and you get a t-shirt or a song.

John85851 (profile) says:

Re: Some musings

You may not be able to copyright a number, but you can scare people into thinking you can.
Years ago, I used to sell digital models on TurboSquid. One of models had a description that said "includes a table model with 1,747 polygons". Their system flagged and said I couldn’t use "747" because it was a Boeing copyright. Yes, human would see that I’m selling a furniture model, not an airplane, but their automated system was set up to flag and avoid possible complaints from companies like Boeing.
This rule also applied to numbers like 350 (BMW), 356 (Porsche), 250 (Ferrari), and so on.

I once uploaded an F-14 aircraft model and my description includes a little history about how the aircraft served on the USS Enterprise aircraft carrier. Their system flagged it and said "Enterprise" was copyrighted by CBS/ Viacom.
(Yet there are plenty of Star Trek models for sale at TurboSquid, so it’s not like this flagging is stopping anyone from selling Star Trek models.)

Can someone copyright the word "Enterprise" in every single usage? Probably not, but if a company can scare people into believing they own it, then that’s good enough.

That Anonymous Coward (profile) says:

slaps the editor, hands out a d
"A review cannot be requeste for this restriction."

This is the most wrong portion of this.
I mean random numbers triggering the bot is bad, but the fact there is no redress or recourse to challenge it horrifying.

Broken tools & broken systems aside, removing the ability to challenge the "findings" when you spot what you think is an error is wrong on so many levels.

But then the entire system is lopsided in believing anyone who can claim they hold a copyright would never ever fib (despite the huge pile of cases where scammers are making bank) & that it is to onerous or impossible to challenge the claims when even a child could see its not infringing.

Rekrul says:

Re: Re:

This is the most wrong portion of this.
I mean random numbers triggering the bot is bad, but the fact there is no redress or recourse to challenge it horrifying.

I’ve told this story before, but a couple years ago, I got an email telling me that I’d been banned from posting comments on YouTube. It claimed that I had violated the community standards against spam/advertising. I hadn’t posted anything that could be considered either. Strangely, my channel page showed that I had no strikes for anything.

I disputed the ban and a day later received an email saying that they had looked into it and decided that the ban was appropriate. There was no mention of what I supposedly posted and no further options.

I posted on the help forum, someone said that they would mention it to a moderator, but made no promises. About a week later, I got an email that after further review, the ban had been lifted. No explanation of what triggered it in the first place, no explicit admission that they screwed up.

All I can think of is that the night before, I discovered a new channel, watched several of the videos and commented on them. All were unique, on-topic, completely non-controversial comments. Still, maybe their AI is so dumb that it considers too many comments in too short a time to be spam, without even taking the contents into consideration. And by that, I mean checking to see if I had posted the same comment on multiple videos.

I am convinced that my dispute of the ban was simply rejected by the AI without any actual review. I think they only offer a dispute option so that they can claim that users can dispute problems. I don’t think filing such disputes will ever actually do anything.

Anonymous Coward says:

Re: Copyrighting numbers

I remember a mathematician arguing that all numbers are interesting. I wonder if that applies to copyrightability?

I don’t know. It just doesn’t feel right. I can’t come up with a general argument so I’ll resort to a thought experiment.

Let’s assume that numbers are copyrightable. Some of them belong to the public domain because the ancient Sumerians published them. I’m also sure that there are a ton of orphan works, numbers for which no one knows the author. Okay. I don’t see anything particularly strange yet. Do you?
Now suppose there’s a copyright registration system. Assume that people somehow have digital computers identical to the one I used to post this. Let’s also assume that people know how to write software and that the coders use C. How do you register a number? Do you have to literally write it out in base 10? Can you write it in English words? What happens if someone else tries to register the number in Greek? What happens if someone registers something that is at its core a number, but the person has no idea what that number actually is? For example, what happens if I try to register an image file? Do I also get the copyright for the binary representation of that file? What if I have no idea what the binary representation is? What if I have no idea what the base ten representation is? Why bother copyrighting numbers ever again if I can instead just copyright the images?

I got nothing useful out of that.

A chemist argues that all chemicals are interesting.
A chef argues that all foods are interesting.
A physicist argues that all reference frames are interesting.
A theologian argues that all entities in religious texts are interesting.
A musician argues that all sound waves are interesting.

I got nothing useful out of that too. Maybe you got something useful out of it though.

Rekrul says:

Now, to be fair, there is an open question in this case as to whether the filepath names that were chosen somehow were what was getting flagged.

Even if the path names contained the names of copyrighted works, path names themselves don’t qualify as copyright infringement. As such, any bot SHOULD have been designed to only flag and analyze FILES with infringing sounding names. And even then, if the contents don’t match, there’s no infringement.

I’m not a programmer, although I do write Windows Batch scripts for my own personal use and whenever I write one to handle any task that might involve unknown conditions, I try to think of everything that could possibly go wrong and account for it. I don’t always succeed, but I like to think I at least cover the most obvious possible problems.

DB says:

TD kind of missed the point on this article. Why can’t you have document with copyrighted material that you don’t own? It might make sense if you share that file publicly, but even then it’s just dumb. Drive is a file utility and there are a million reasons why you might have such a file. Like, say, a file of song lyrics that you’re going to sing with some friends… a file of inspiration images for some project… a pdf of a book you bought…

Anonymous Coward says:

[Off topic]

I’m starting to suspect that the average person thinks copyright is "I made this. Don’t copy me." They’re wrong. Copyright is about more than copying. It’s about control, politics, money, and the power of pathos. I’m starting to think that high schools / secondary schools should teach all students about the history of copyright. All of it. No music-industry-funded cherry-picking. Starting from the printing press censorship drama all the way until Winnie the Pooh’s ascendance. I suspect that it won’t go well. At some point some student is bound to wonder why copyright lasts so long and why copyright proponents have no idea what "accountability" means.
Ethics aside, sometimes I really wish I could see an accurate simulation of humans. I want to see how a society with our technology behaves when every person has no idea what copyright is and has never been taught about anything remotely related to the concept of copyright.

Anonymous Coward says:

Re: [Off topic]

Interesting you should say that. I have come to believe, in part because of evidence (note I said "evidence", not "proof"), that reincarnation is actually a thing. But one of the open questions among those who believe in the possibility of reincarnation is whether we are able to pick where we reincarnate, or even if we have the option to not reincarnate at all. One thing I have thought about is that if we do have a choice, I would not want to be reborn in the United States again, but in particular I kind of hope I never have to live another life on planet Earth again. Earth is a lovely planet and there are some wonderful things here, but there are also some very terrible things here, and some of the worst kinds of people, but to my mind one of the absolute worst things about planet Earth is that the ridiculous concept of "intellectual property" (in all its various forms) ever became a thing. It is like so many other things on this planet, where someone possibly had good intentions (or at least not purely evil intentions) but had NO idea of the monster they were creating (thought they might have if they had really thought it through).

When people come up with something they think is a good idea, they ought to ask themselves, how could evil people misuse this idea? Because it seems that is invariably what happens sooner or later. Many of the ideas put into the U.S. Constitution and Bill of Rights were good ideas at the time, until people started misusing them to gain power and/or make a profit, or just to make life miserable for other people.

So if I absolutely have to be reborn somewhere, I really hope I can pick a planet where the concept of intellectual property has either never crossed the mind of anyone, or if it did, it was promptly shut down and ridiculed as the evil thing it is. That’s not the only thing I’d like to see, of course (no religions that exist primarily to make people subservient to other people would be a big one) but it’s very high on my list of things I’d like to never see again.

Anonymous Coward says:

worth vs wealth

Google, Facebook, Amazon, Microsoft, etc. are all making money off of our data. Can I have some of that money, since it is my data? Why can’t we all come up? Google is making more money than they know what to do with (e.g. their multitude of failed/abandoned products and projects); Amazon’s going to the moon (eventually); on and on…
Since my data is worth soooo much, share the wealth.

Raymondjoype (user link) says:

10 топовых российских блогеров кото

And while, french massage and not violates practically any prohibitions, for the reason it's not about sexual contact.
Dear gentlemen!
In school sensual massage women will hold erotic 4hands massage. Similar swedish massage, as in principle, and relaxation, influences on some area human body, this give a chance male gain strength.
The energy massage inSoho it today skill give away bliss. The Soapy massage – on the influence on clients is meant practically unlimited available opportunities actions on bodily, and consequently, and psychoemotional state of health friends.
Systematically visiting the four hands massage for clients, you guarantee himself excellent sexual relaxation.

<a href=>Где создать блог когда это уже стало модным Медиа Нетологии</a>

Add Your Comment

Your email address will not be published. Required fields are marked *

Have a Techdirt Account? Sign in now. Want one? Register here

Comment Options:

Make this the or (get credits or sign in to see balance) what's this?

What's this?

Techdirt community members with Techdirt Credits can spotlight a comment as either the "First Word" or "Last Word" on a particular comment thread. Credits can be purchased at the Techdirt Insider Shop »

Follow Techdirt

Techdirt Daily Newsletter

Techdirt Deals
Techdirt Insider Discord
The latest chatter on the Techdirt Insider Discord channel...