Anonymous Coward

October 25, 2022 at 4:26 pm

Not a stackoverflow searcher

GitHub CoPilot definitely has the capability to generate novel solutions. While there are reports that it does sometimes regurgitate verbatim code, I am not sure how frequent that problem is. Most notably, Microsoft added an option to scan the suggestions against the training set to prevent the AI from spitting out verbatim matches.

I would say that it is not terribly concerning at the moment. It’s fine to quickly google the code it produces to check and just do a bit of minimal due diligence.

Ehud Gavron (profile)

October 25, 2022 at 5:31 pm

The "problem" of copyrights has a workaround.

Copyright issues are indeed aggravating, but there are developed mechanisms dealing with issues created by copyright. Fair-use in the US is one. Licenses that impose restrictions exist BECAUSE of copyright. No copyright, no “right to release under a license.”

The terms of the license, and I’ll speak to the GPL here, is that it allows use of the content, copying, including, re-using, etc. AS LONG AS the person doing so complies with the term of the license.

When Copilot provides a verbatim copy of a project “piece” (be it a file, a commit, a branch, trunk, archive, etc.) and does so in violation of the license and is not fair-use, that is a problem.

Analogy: When I sign a non-disclosure agreement (NDA) the parties generally agree that non-public information disclosed pursuant to that agreement will not be disclosed, unless it is publicly disclosed –through no action of their own– and otherwise available.

Copilot effectively renders the “secret” material public, license-free, obligation-free. I can’t copy module matrix_rewrite_X from the repository without being subject to its license terms. These may be incompatible with other licenses in the same project.

If I word my query right and Copilot gives me that same module but absent any license restrictions, I’m not only to free to use it, but free to provide it to others on my new website “UnlicensedPreviouslyLicensed.com” (UPL is also unlicensed practice of law, which –as I am not a lawyer– I’m not doing here.)

It’s great to blame MS or GH much like people blame Mr. Musk for something Tesla did or blame Mr. Bezos for screwing over ULA by having Blue Penis Origin focus on getting him into “space” before Mr. Branson instead of developing the BE-4 engine, but that’s not who or where the problem starts or ends, and blaming them is just misguided fingerpointing.

If MS-owned-GH wanted to take a proactive approach, they could request all [nonorphaned] projects to include a “COPYRIGHT.txt” and/or “LICENSE.txt” and have Copilot include that WHEN THE RESULT IS AN ENTIRE SECTION of the repository.
I don’t think that’s feasible for a variety of reasons, mostly because a)orphan works, b)why should X devs with Y projects and Z files do X*Z edits so that Copilot can avoid violating their rights — 3)the burden should be on Copilot.

Or, my favorite:
Copilot displays a disclaimer, links to the ORIGINAL GH files (as I believe it already does) and advises the recipient of the search result to comply with all agreements, regulations, licenses, etc. Is this stupid? Maybe, but so is the value of [I agree] in a court of law.

E

Anonymous Coward

October 25, 2022 at 7:36 pm

Re: making an analogy with NDAs

Your analogy comes off as incomplete. I can see why Christenson was confused. You need to include an analogue to the output of a machine learning program on open source or libre software.

Here’s what I have in mind: Imagine that a very good security breaker has taken NDAed information from many companies and has trained a machine learning program on that information. Does the novel nature of the output of the ML program mean that the NDA obligations which apply to the input don’t apply to the output? With respect to the legality, I have no idea. With respect to the social acceptability, I would say that stuffing NDAed information into an ML program should not be permissible, unless the maker of the program asks each company for permission to release the output. Or alternatively, society could do away with the concept of NDAs. (I’m NOT saying that that’s a good idea!)

Similarly, I would argue that stuffing code licensed under GPLs (there are multiple versions) into an ML program doesn’t free Github from the license obligations, unless the maker explicitly releases all of the output under the latest version of the GPL. Or alternatively, society could get rid of the concept of copyright for software. (Again, whether that’s a good idea is a different question.) If Github can’t comply with this, then Copilot should not be allowed to exist. (Legally, Github can’t comply with the GPL’s obligations, because some of the code in the training set is proprietary despite being source-available.)

nasch (profile)

October 26, 2022 at 10:21 am

Re: Re:

Similarly, I would argue that stuffing code licensed under GPLs (there are multiple versions) into an ML program doesn’t free Github from the license obligations, unless the maker explicitly releases all of the output under the latest version of the GPL.

If I stuff my brain with a bunch of GPL code, do I then need to open source code that I write that was enabled by what I learned from that GPL code? Obviously not. Is the way a machine learning program operates different from a human in a legally significant way? If so, how?

Steven (profile)

October 26, 2022 at 10:46 am

Re: Re: Re: Not quite so clear cut

That’s not a clear cut as you make it sound. There are large projects that require ‘clean room’ standards for their contributors for that exact reason.

As an example, in the WINE project you are not allowed to contribute if you have ever seen windows source code. From their developer FAQ: “This would be anyone who has seen Microsoft Windows source code (stolen, under an NDA, disassembled, or otherwise).”

Clearly this is a corner case, and they are being overly cautious because of who they are dealing with, but this can be a legal grey area depending on the specifics.

Anonymous Coward

October 26, 2022 at 9:25 pm

Re: Re: Re:

You raise a good point, but it’s not as obvious as you say. Replace “GPL code” with “information under NDA”, “trade secrets”, or “information about patented inventions”. If no one finds out, then you can’t get caught. But every once in a while the output will be so similar to an input that someone will find out. Who should be held accountable? Or should no one be punished at all? And suppose that Copilot were trained only on GPLed code. Legally, the output may be free from the GPL’s obligations. But in principle, wouldn’t you agree that this hypothetical version of Copilot breaks the social contract of the GPL, given that copylefted code is in a sense more free than public domain code? And it would be simple for Microsoft to exclude almost all GPLed code on Github by searching for licenses and the boilerplate that almost all GPLed code files include. A best effort shouldn’t be punished, after all. But Microsoft won’t even attempt to exclude GPLed code, will they?

Legally, Microsoft may turn out to be aboveboard. But there is also the issue of power. My concern is that Microsoft is deliberately breaking social contracts associated with copyleft licenses. Or more conservatively, Microsoft is being a jerk on purpose because Microsoft believes that authors of GPLed code can’t fight back in court. Microsoft would never put their own code into an ML program like Copilot, would they? It shouldn’t be a problem for Microsoft’s profits if the output isn’t similar to the input, right? There are no hard answers, but it’s not a stretch to conclude that Microsoft is acting in bad faith in ignoring copyleft license obligations.

TLDR: Legally, you may be correct. But socially, the answer isn’t as obvious as make out. Microsoft might be spiting license obligations (especially copyleft) on purpose. Microsoft certainly wouldn’t include their own code in Copilot, and they won’t voluntarily make any attempt to exclude GPLed code from the training set.

stine

October 26, 2022 at 4:06 am

Re: the problem is you've ignored the fact that this is a computer

The problem is you’ve ignored the fact that this is a computer. That computer should have been programmed with the details of every available license for GH code elements. That program could then search for elegant solutions in ?all? code, but only provide matching code from repositories with appropriate licenses. The fact that it does not do this indicates to me that the authors of this tool did not consider the code’s license important.

nasch (profile)

October 26, 2022 at 10:25 am

Re: Re:

That program could then search for elegant solutions in ?all? code, but only provide matching code from repositories with appropriate licenses.

This seems to assume that Copilot is just spitting out verbatim code snippets in response to a search query. I have not used it, but my understanding is it suggests code tailored to what the user is doing, which is informed by its training set but not necessarily identical to anything found therein. So the question isn’t as simple as whether it’s allowed to copy a particular piece of code.

Anonymous Coward

October 25, 2022 at 6:14 pm

How and Why are different questions

This post’s comment re context is everything for me. It is entirely possible that Copilot would give me a solution I could just plunk down into my code without problem or thought.

… but if I don’t know why the code works, it is useless to me, because I’d then have to go back to Copilot the next time as well. Stack Overflow most often includes the “why” to what I want to know.

I could, though, mention some of the horrible, horrible Medium posts I’ve waded through only to find that it didn’t come close to answering my question.

Christenson

October 25, 2022 at 6:47 pm

Clarifying the source licenses..

Ehud:
It does not appear that copilot is using any code that you could not hunt down in the public part of github, so none of it is “secret”.

In addition to “no warranty”, licenses say different things.
GPL says “You must provide access to the source code” and “require users to do the same”.
Others require you to acknowledge your contributors.

Both of those are scholarly courtesies that Copilot could easily reinforce, but go against the grain of commercial product development — think Windows or your favorite Internet of Garbage toy.

To the extent copilot does not identify GPL material and its source, I think copilot violates the GPL. Not that I think Mike looking up the 10 lines of how to do something is not fair use; it’s the repetition of that act a million times by copilot that changes the nature of the copying.

Christenson

October 26, 2022 at 1:53 am

Re: An interesting license for nmap...

As part of a studying project, I encountered the license for nmap, which specifically limits the use of its otherwise open code as follows:

Even though Npcap source code is publicly available for review, it is
not open source software and may not be redistributed or used in other
software without special permission from the Nmap Project. The
standard (free) version is usually limited to installation on five
systems. We fund the Npcap project by selling two types of commercial
licenses to a special Npcap OEM edition:

That’s a use case Microsoft should have to agree to, or keep the code off of github…

Christenson

October 26, 2022 at 7:36 am

Re: Re: More license complication -- anti-patent + GPL

I seem to be getting schooled this week on licenses: nmap (www.nmap.org) network mapping tool is under GPL, but revokes that license if you sue and allege nmap infringes any patent.

Christenson

October 28, 2022 at 12:56 am

Re: Re: Re: yet another license twist

RFC 8446 (TLS 1.3) at https://datatracker.ietf.org/doc/html/rfc8446 has an interesting twist on licensing — Schroedigner’s permission.

This document may contain material from IETF Documents or IETF
Contributions published or made publicly available before November
10, 2008. The person(s) controlling the copyright in some of this
material may not have granted the IETF Trust the right to allow
modifications of such material outside the IETF Standards Process.
Without obtaining an adequate license from the person(s) controlling
the copyright in such materials, this document may not be modified
outside the IETF Standards Process, and derivative works of it may
not be created outside the IETF Standards Process, except to format
it for publication as an RFC or to translate it into languages other
than English.

naoEntendo (profile)

October 25, 2022 at 6:55 pm

It's not the same, but it was bound to be tried eventually

While CoPilot is an interesting idea that was bound to be tried sooner or later, I believe it’s currently, especially in how it was/is trained, more like ClearView than StackOverflow.

ClearView scrapped images from all over the web without the owners permission, or even awareness.

CoPilot scrapped source code from GitHub without the owners permission, or even awareness.

Some will argue that buried in the tens of thousands of pages of legalese governing GitHub, legalese I might remind you Microsoft can alter at it’s unilateral whim anytime it wants to, gives Microsoft the right to use the source code as it sees fit. Others will argue that it’s fair use (isn’t that what ClearView is arguing, that the use of people’s images is fair use?). I think those arguments miss the point.

Microsoft should have been open and above board before they used other people’s code as grist for CoPilot’s mill. I believe that they weren’t because they feared that folks would start deleting their repositories en masse. What good is a source code repository without much source code? Just as SourceForge.

Some folks may have pulled their repositories, other repositories are public domain, and still other authors might not have cared. Microsoft is counting on the fact that they are Microsoft with all of the clout and lawyers that entails to see them through.

Instead, the Software Freedom Conservancy is urging all open source projects to pull their source code off of GitHub. People are now wondering, “If Microsoft can do this with my source code, what’s next?”. And the knowing grey beards in the back are saying, “See I told you this would happen when Microsoft bought GitHub.”

I wouldn’t be surprised if the next round of opensource licenses contains a new clause strictly prohibiting use of the source code in any ML data sets.

The author opined that CoPilot is a lot like Google’s scanning project, now Google Books. CoPilot (C) is different than Google Books (G) in many important ways.

G allows analysis to be performed on book contents, such as how often “he” shows up, or how many works reference “boats”. When G lets you browse a particular book, you know the book’s name, author, etc. G limits the amount of access based on the rights to the book. Public domain books are free to read/download in their entirety. Other books are limited from between a few lines, to many/most pages, depending on the wishes of the copyright holder.

C on the other hand doesn’t let users perform analysis, although it does analyze software repositories. It doesn’t let the user know which work or the author, or the license the answer comes from. C doesn’t let you see the context, it just presents an answer, typically verbatim copying from some other authors work. C doesn’t respect the wishes of the code’s author or license.

It’s a shame Microsoft decided to well, be Microsoft. Hopefully the courts will decide fairly (yea, I know, there’s not the best track record there) and that when someone else goes down this path, they do it with a lot more consideration.

Anonymous Coward

October 25, 2022 at 9:12 pm

Re:

Some will argue that buried in the tens of thousands of pages of legalese governing GitHub, legalese I might remind you Microsoft can alter at it’s unilateral whim anytime it wants to, gives Microsoft the right to use the source code as it sees fit.

What happens when somebody uploads a large codebase like Linux to GitHub, without the permission of every copyright holder? Are such uploads against the terms of service?

Ehud Gavron (profile)

October 25, 2022 at 10:17 pm

This isn't just limited to software...

Arstechnica reports that ShutterStock may have the same problem because they used their database of images to train an AI to provide… images.

| What happens when somebody uploads a large codebase like Linux to GitHub, without the permission of every copyright holder? Are such uploads against the terms of service?

Linux (or GNU/Linux if you prefer) went from GPLv2 to GPLv3 over a period of years because it was required to get EVERYONE who had contribute GPLv2-licensed code to agree to “GPLv3 [or greater]”. I don’t think “every copyright holder” in anything that large is likely to be
– reachable
– amenable
– able
– responsive within a set timeframe
for that to occur.

E

cls

October 27, 2022 at 1:39 am

Re: analogous to orphaned works

> I don’t think “every copyright holder” in anything that large is likely to be
– reachable
– amenable
– able
– responsive within a set timeframe
for that to occur.

Directly analogous to orphaned works in books, etc.

Could be solvable with shorter copyright duration and non automatic renewal.

TKnarr (profile)

October 25, 2022 at 10:26 pm

Stackoverflow vs. Copilot

I think the comparison to Stackoverflow answers misses an important point: when you take an SO answer you’re doing so with at least the implicit consent of the person who wrote the answer (they posted their answer in response to a question, they have to at least consider that the answer will be used by the questioner). That’s a reasonable basis for you to assume the code there can be used (unless the respondent said otherwise). Copilot OTOH spits out code with no involvement from the authors of the code it was trained on, without them knowing their code would be used to provide the answer, and without any concern on Copilot’s part about what conditions might be attached to the code. You have no idea if the authors of the code would’ve consented to provide it in answer to a question if asked. You don’t even know that it’s not a copy of proprietary code. Just being on the public portion of Github doesn’t imply anything, books after all are publicly readable by anybody who picks one up but that by itself doesn’t grant anyone any rights to take the text they read and use it elsewhere.

The problem isn’t copyright, or Github. The problem is that Copilot is presented as something it isn’t. It’s presented as “AI”, something capable of coming up with original work on it’s own, when what it is is simply a very complex program that takes a representation of data that it’s been given, transforms it in a deterministic fashion based on rules it has and produces the resulting transformed material as output. It’s data set is large beyond comprehension, it’s rules are so complex no human can understand them, it can spot patterns in it’s data that no human would ever notice and add a rule describing them to it’s rule set, but it hasn’t yet passed the threshold where it can create original material (material that wasn’t provided to it as input or deterministically derived from it’s data according to rules in it’s rule set).

TKnarr (profile)

October 25, 2022 at 10:29 pm

Re:

Oh, and if you scoff at the idea of a program so complex it’s developers can’t understand what it’s doing, let me assure you every professional software engineer out there (and most of the amateurs) deals with exactly that on a daily basis (mostly by inventing new curses to heap on it on a regular basis).

nasch (profile)

October 26, 2022 at 11:03 am

Re:

That’s a reasonable basis for you to assume the code there can be used (unless the respondent said otherwise).

There’s no need to make assumptions; the materials are provided under specific license terms. The exact implication of those terms for your specific use are a different matter.

https://stackoverflow.com/help/licensing

Steven (profile)

October 26, 2022 at 11:04 am

Not sure I buy the draining open source community argument.

I am a full time software developer. More than 15 years in the industry.

I have on rare occasion contributed to open source software. I would be one of those ‘low-touch interactions’. I have often wanted to get more involved with open source projects, but I haven’t largely because of the time it would take.

I can’t think of a single time I’ve been drawn to an open source project because I searched for help putting together some small number of lines of code. While I understand there are some libraries that are very small I generally wouldn’t use one that I could replace with a few lines of code, barring some very specialized task.

I can’t think of a single interaction with an open source project I would have, that copilot would get in the way of.

Almost all my use of open source code is going to be in the form of a library or tool.

If I’m searching for some small code solution I’m not looking for open source code. I’m likely going to find the solution on a blog or something like stackoverflow.

If I’m searching for functionality I think a library/tool would be the right thing for, I’m looking for a library/tool and wouldn’t be in a situation to use copilot anyway.

If I’m using a library/tool and have an issue I’m either going to google ‘issue + library name’ or go directly to the site for the library/tool.

I’d like to hear of a situation where copilot would get in somebodies way of interacting with open source. I just don’t see it.

Crafty Coyote

October 26, 2022 at 11:56 am

CoPilot should have the disclaimer-“Legal to use at the time of release.”

Like most of these things related to copyright, make the most of it while it is still legal to do so

pegr

October 26, 2022 at 1:21 pm

The author’s legal analysis is weak

Developers miss a very important point about software. Not all components of software are copyrightable. If someone lifts your sort routine, they have not violated your copyright.

Copyright applies only to creative expressions. Purely functional expressions are not subject to copyright.

Now if a search engine can generate code to do something mundane, how creative were all the original expressions the search engine indexed? If the expression is commonly used in many applications, I’m betting it’s purely functional in nature.

The argument doesn’t even need to mention “Fair Use”.

Anonymous Coward

October 26, 2022 at 3:31 pm

Re:

If someone lifts your sort routine, they have not violated your copyright.

This is overly simplistic, and only a court can really make that determination. Things like variable names and comments could be considered creative enough to be copyrightable.

Copyright applies only to creative expressions. Purely functional expressions are not subject to copyright.

Prior to the late 1970s, all software was considered “purely functional” and was not subject to copyright. I still think that’s the most sensible view.

Kinetic Gothic

October 27, 2022 at 6:00 am

So I'm not alone..

I’m glad I’m not the only one who’s thought of the interaction between copyright and face recognition. When I first read about Clearview here, my first reactions was “they’re making a commercial product based based on photos that I own the copyright to, they’re not much different from the people at Printerval who scraped my t-shirt graphics from Spreadshirt.” a C&D asking them to delete any works based on my feeds was certainly on my mind.

Anonymous Coward

October 28, 2022 at 1:10 am

Hypocrisy

I think some of the anger around Copilot is the fact that if it were an individual or small startup behind it attempting to train a machine-learning model from public source code on GitHub and other public sources that quite possibly they would be sued by many stakeholders of copyright (including possibly Microsoft) and it would be potentially be a difficult hole to get out from.

Friday
19:39	The FDA Takes Its Turn Burying Studies Showing The Safety Of COVID, Shingles Vaccines (0)
15:55	Ken Paxton Wanted To Crack Down On Forum Shopping. Now Lawyers Say He’s Improperly Seeking Out Favorable Courts. (2)
13:14	France's Terrible Copyright Law, Hadopi, Is Not Quite Dead (1)
10:59	Journalists Identify Murder Victims Of Trump's Boat Strike Program (13)
10:54	Daily Deal: Headway Premium Memorial Day Sale (0)
09:32	SpaceX's IPO Filing Shows Elon's Twitter 'Business Genius' Was A Fantasy (10)
05:32	Amazon Gets Into The AI Podcast Slop Business (9)
Thursday
20:02	Post Loss Clarity: Bill Cassidy Rediscovers His Spine As A Lame Duck Senator (9)
16:48	Ctrl-Alt-Speech: Message In A Bottleneck (0)
13:04	The Science Is Not Settled: How Weak Evidence Is Fueling A National Push To Ban Social Media For Youth (14)

If GitHub Copilot Is A Copyright Problem, Perhaps The Problem Is Copyright

from the ai-did-not-write-this-article dept

What is Copilot, Really?

Will Copilot Create One Less Reason to Interact Directly with Open Source Code?

Copilot is Trained on Open Source, Not Trained on Open Source

Is This Legal?

Where Does That Leave Things?

Comments on “If GitHub Copilot Is A Copyright Problem, Perhaps The Problem Is Copyright”

Add Your Comment Cancel reply

Comment Options:

What's this?

Techdirt Daily Newsletter

Get all our posts in your inbox with the Techdirt Daily Newsletter!

The Techdirt Greenhouse

Trending Posts

Friday

Thursday

More

Email This Story

Tools & Services

Company

Contact

More