If GitHub Copilot Is A Copyright Problem, Perhaps The Problem Is Copyright

from the ai-did-not-write-this-article dept

Last week a new GitHub Copilot investigation website created by Matthew Butterick brought the conversation about GitHub’s Copilot project back to the front of mind for many people, myself included. Copilot, a tool trained on public code that is designed to auto-suggest code to programmers, has been greeted by excitement, curiosity, skepticism, and concern since it was announced.

The GitHub Copilot investigation site’s arguments build on previous work by Butterick, as well as thoughtful analysis by Bradley M. Kuhn at the Software Freedom Conservancy. I find the arguments contained in these pieces convincing in some places and not as convincing in others, so I’m writing this post in the hopes that it helps me begin to sort it all out.

At this point, Copilot strikes me as a tool that replaces googling for stack overflow answers. That seems like something that could be useful. It also seems plausible that training such a tool on open public software repositories (including open source repositories) could be allowed under US copyright law. That may change if or when Copilot evolves, which makes this discussion a fruitful one to be having right now.

Both Butterick and Kuhn combine legal and social/cultural arguments in their pieces. This blog post starts with the social/cultural arguments because they are more interesting right now, and may impact the legal analysis as facts evolve in the future. Butterick and Kuhn make related arguments, so I’ll do my best to be clear which specific version of a point I’m engaging with at any given time. As will probably become clear, I generally find Kuhn’s approach and framing more insightful (which isn’t to say that Butterick’s lacks insight!).

What is Copilot, Really?

A large part of this discussion seems to turn on the best way to think about and analogize what Copilot is doing (the actual Copilot page does a pretty good job of illustrating how one might use it).

Butterick seems to think that the correct way to think about Copilot is as a search engine that points users to a specific part of a specific (often open source) software package. In his words, it is “a con­ve­nient alter­na­tive inter­face to a large cor­pus of open-source code”. He worries that this “selfish interface to open-source software” is built around “just give me what I want!” (emphasis his).

The selfish approach may deliver users to what they think they want, but in doing so hides the community that exists around the software and removes critical information that the code is licensed under an open source license that comes with obligations. If I understand the argument correctly, over time this act of hiding the community will drain open source software of its vitality. That makes Copilot a threat to open source software as a sustainable concept.


The concern about hiding open source software’s community resonates with me. At the same time, Butterick’s starting point strikes me as off, at least in terms of how I search for answers to coding questions.

This is probably a good place to pause and note that I am a Very Bad coder who, nonetheless, does create some code that tends to be openly licensed and is just about always built on other open source code. However, I have nowhere near the skills required to make a meaningful technical contribution to someone else’s code.

Today, my “convenient alternative interface” to finding answers when I need to solve coding problems is google. When I run into a coding problem, I either describe what I am trying to do or just paste the error message I’m getting into google. If I’m lucky, google will then point me to stack overflow, or a blog post, or documentation pages, or something similar. I don’t think that I have ever answered a coding question by ending up in a specific portion of open source code in a public repo. If I did, it seems unlikely that code – even if it had great comments – would get me where I was going on its own because I would not have the context required to quickly understand that it answered my question..

This distinction between “take me to part of open source code” (Butterick’s view) and “help me do this one thing” (my view) is important because when I look at the Copilot website, it feels like Copilot is currently marketed as a potentially useful stack overflow commenter, not someone with an encyclopedic knowledge of where that problem was solved in other open source code. Butterick experimented with Copilot in June and described the output as “This is the code I would expect from a talented 12-year-old who learned about JavaScript yesterday and prime numbers today.” That’s right at my level!

If you ask Copilot a question like “how can I parse this list and return a different kind of list?,” in most cases (but, as Butterick points out, not all!) it seems to respond with an answer synthesized from many different public code repositories instead of just pointing to a single “best answer” repo. That makes Copilot more of a stack overflow explorer than a public code explorer, albeit one that is itself trained by exploring public code. That feels like it reduces the type of harm that Butterick describes.

Use at Your Own Risk

Butterick and Kuhn also raise concerns about the fact that Copilot does not make any guarantees about the quality of code it suggests. Although this is a reasonable concern to have, it does not strike me as particularly unique to Copilot. Expecting Copilot to provide license-cleared and working code every time is benchmarking it against an unrealistic status quo.

While useful, the code snippets I find in stack overflow/blog post/whatever are rarely properly licensed and are always “use at your own risk” (to the extent that they even work). Butterick and Kuhn’s concerns in this area feel equally applicable to most of my stack overflow/blog post answers. Copilot’s documentation if fairly explicit about the value of the code it suggests (“We recommend you take the same precautions when using code generated by GitHub Copilot that you would when using any code you didn’t write yourself.”), for whatever that is worth.

Will Copilot Create One Less Reason to Interact Directly with Open Source Code?

In Butterick’s view, another downside of this “just give me what I want” service is that it reduces the number of situations where someone might knowingly interact with open source code directly. How often do most users interact directly with open source code? As noted above, I interact with a lot of other people’s open source software as an extremely grateful user and importer of libraries, but not as a contributor. So Copilot would shift my direct deep interaction with open source code from zero to zero.

Am I an outlier? Nadia Asparouhouva (née Eghbal)’s excellent book Working in Public provides insight into open source software grounded in user behavior on GitHub. In it, she tracks how most users of open source software are not part of the software’s active developer community:

“This distribution – where one or a few developers do most of the work, followed by a long tail of casual contributors, and many more passive users – is now the norm, not the exception, in open source.”

She also suggests that there may be too much community around some open source software projects, which is interesting to consider in light of Butterick’s concern about community depletion:

”The problem facing maintainers today is not how to get more contributors but how to manage a high volume of frequent, low-touch interactions. These developers aren’t building communities; they’re directing air traffic.”

That suggests that I am not necessarily an outlier. But maybe users like me don’t really matter in the grand scheme of open source software development. If Butterick is correct about Copilot’s impact on more active open source software developers, that could be a big problem.

Furthermore, even if users like me are representative today, and Copilot is not currently good enough to pull people away from interacting with open source code, might it be in the future?

“Maybe?” feels like the only reasonable answer to that question. As Kuhn points out, “AI is usually slow-moving, and produces incremental change far more often than it produces radical change.” Kuhn rightly argues that slow-moving change is not a reason to ignore a possible future threat. At the same time, it does present the possibility that a much better Copilot might itself be operating in an environment that has been subject to other radical changes. These changes might enhance or reduce that future Copilot’s negative impacts.

Where does that leave us? The kind of casual interaction with open source code that Butterick is concerned about may happen less than one might expect. At the same time, today’s Copilot does not feel like a replacement for someone who wants to take a deeper dive into a specific piece of open source software. A different version of Copilot might, but it is hard to imagine the other things that might be different in the event that version existed. Today’s version of Copilot does not feel like it quite manifests the threat described by Butterick.

Copilot is Trained on Open Source, Not Trained on Open Source

For some reason, I went into this research thinking that Copilot had explicitly been trained on open source software. That’s not quite right. Copilot was trained on public GitHub repositories. Those include many repositories of open source software. They also include many repositories of code that is just public, with no license, or a non-open license, or something else. So Copilot was trained on open source software in the sense that its training data includes a great deal of open source software. It was not trained on open source software in the sense that its training data only consists of open source software, or that its developers specifically sought out open source software as training data.

This distinction also happens to highlight an evolving trend in the open source world, where creators conflate public code with openly licensed code. As Asparouhouva notes:

”But the GitHub generation of open source developers doesn’t see it that way, because they prioritize convenience over freedom (unlike free software advocates) or openness (unlike everly open source advocates). Members of this generation aren’t aware of, nor do they really care about, the distinction between free and open source software. Neither are they fired up about evangelizing the idea of open source itself. They just publish their code on GitHub because, as with any other form of online content today, sharing is the default.”

As a lawyer who works with open source, I think the distinction between “openly/freely licensed” and “public” matters a lot. However, it may not be particularly important to people using publicly available software (regardless of the license) to get deeper into coding. While this may be a problem that is exacerbated by Copilot, I don’t know that Copilot fundamentally alters the underlying dynamics that feed it.

As noted at the top, and attested to by the body of this post so far, this post starts with the cultural and social critiques of Copilot because that is a richer area for exploration at this stage in the game. Nonetheless, the critiques are – quite reasonably – grounded in legal concerns.

Fair Use

The legal concerns are mostly about copyright and fair use. Normally, in order to make copies of software, you need permission from the creator. Open source software licenses grant those permissions in return for complying with specific obligations, like crediting the original creator.

However, if the copy being made of the software is protected by fair use, the copier does not need permission from the creator and can ignore any obligations in a license. In this case, GitHub is not complying with any open source licensing requirements because it believes that its copies are protected by fair use. Since it does not need permission, it does not need to copy with license requirements (although sometimes there are good reasons to comply with the social intent of licenses even if they are not legally binding…). It has said as much, although it (and its parent company Microsoft) has declined to elaborate further.

I read Butterick as implying that GitHub and Microsoft’s silence on the details of its fair use claim means that the claim itself is weak: “Why couldn’t Microsoft pro­duce any legal author­ity for its posi­tion? Because [Kuhn and the Software Freedom Conservancy] is cor­rect: there isn’t any.”

I don’t think that characterization is fair. Even if they believe that their claim is strong, GitHub cannot assume that it is so strong as to avoid litigation over the issue (see, e.g. the existence of the GitHub Copilot investigation website itself). They have every reason to avoid pre-litigating the fair use issue via blog post and press release, keeping their powder dry until real litigation.

Kuhn has a more nuanced (and correct, as far as I’m concerned) take on how to interpret the questions: “In fact, these areas are so substantially novel that almost every issue has no definitive answers”. While it is totally reasonable to push back on any claims that the law around this question is settled in GitHub’s favor (Kuhn, again, “We should simply ignore GitHub’s risible claim that the “fair use question” on machine learning is settled.”), that is very different than suggesting that it is settled against GitHub.

How will this all shake out? It’s hard to say. Google scanned all the books in order to create search and analytics tools, claiming that their copies were protected by fair use. They were sued by The Authors Guild in the Second Circuit. Google won that case. Is scanning books to create search and analytics tools the same as scanning code to create AI-powered autocomplete? In some ways yes? In other ways no?

Google also won a case before the Supreme Court where they relied on fair use to copy API calls. But TVEyes lost a case where they attempted to rely on fair use in recording all television broadcasts in order to make it easy to find and provide clips. And the Supreme Court is currently considering a case involving Warhold paintings of Prince that could change fair use in unexpected ways. As Kuhn noted, we’re in a place of novel questions with no definitive answers.

What About the ToS?

As Franklin Graves pointed out, it’s also possible that GitHub’s Terms of Service allow it to use anything in any repo to build Copilot without worrying about addition copyright permissions. If that’s the case, they won’t even need to get to the fair use part of the argument. Of course, there are probably good reasons that GitHub is not working hard to publicize the fact that their ToS might give them lots of room when it comes to making use of user uploads to the site.

Where Does That Leave Things?

To start with, I think it is responsible for advocates to get out ahead of things like this. As Kuhn points out:

”As such, we should not overestimate the likelihood that these new systems will both accelerate proprietary software development, while we simultaneously fail to prevent copylefted software from enabling that activity. The former may not come to pass, so we should not unduly fret about the latter, lest we misdirect resources. In short, AI is usually slow-moving, and produces incremental change far more often than it produces radical change. The problem is thus not imminent nor the damage irreversible. However, we must respond deliberately with all due celerity — and begin that work immediately.”

At the same time, I’m not convinced that Copilot is a problem. Is it possible that a future version of Copilot would starve open source software of its community, or allow people to effectively rebuild open source code outside of the scope of the original license? It is, but it seems like that version of Copilot would be meaningfully different from the current version in ways that feel hard to anticipate. Today’s Copilot feels more like a fast lane to possibly-useful stack overflow answers than an index that can provide unattributed snippets of all open source software.

As it is, the acute threat Copilot presents to open source software today feels relatively modest. And the benefits could be real. There are uses of today’s Copilot that could make it easier for more people to get into coding – even open source coding. Sometimes the answer of a talented 12 year old is exactly what you need to get over the hump.

Of course, GitHub can be right about fair use AND Copilot can be useful AND it would still be quite reasonable to conclude that you want to pull your code from GitHub. That’s true even if, as Butterick points out, GitHub being right about fair use means that code anywhere on the internet could be included in future versions of Copilot.

I’m glad that the Software Freedom Conservancy is getting out ahead of this and taking the time to be thoughtful about what it means. I’m also curious to see if Butterick ends up challenging things in a way that directly tests the fair use questions.

Finally, this entire discussion may also end up being a good example of why copyright is not the best tool to use against concerns about ML dataset building. Looking to copyright for solutions has the potential to stretch copyright law in strange directions, cause unexpected side effects, and misaddressing the thing you really care about. That is something that I am always wary of, and a prior that informs my analysis here. Of course, Amanda Levandowski makes precisely the opposite argument in her article Resisting Face Surveillance with Copyright Law.

Michael Weinberg is the Executive Director of NYU’s Engelberg Center for Innovation Law and Policy and Board President of Open Source Hardware Association. This article is reposted with permission from Michael Weinberg’s blog.

Filed Under: , , , , , , ,
Companies: github, microsoft

Rate this comment as insightful
Rate this comment as funny
You have rated this comment as insightful
You have rated this comment as funny
Flag this comment as abusive/trolling/spam
You have flagged this comment
The first word has already been claimed
The last word has already been claimed
Insightful Lightbulb icon Funny Laughing icon Abusive/trolling/spam Flag icon Insightful badge Lightbulb icon Funny badge Laughing icon Comments icon

Comments on “If GitHub Copilot Is A Copyright Problem, Perhaps The Problem Is Copyright”

Subscribe: RSS Leave a comment
This comment has been deemed insightful by the community.
Anonymous Coward says:

Not a stackoverflow searcher

GitHub CoPilot definitely has the capability to generate novel solutions. While there are reports that it does sometimes regurgitate verbatim code, I am not sure how frequent that problem is. Most notably, Microsoft added an option to scan the suggestions against the training set to prevent the AI from spitting out verbatim matches.

I would say that it is not terribly concerning at the moment. It’s fine to quickly google the code it produces to check and just do a bit of minimal due diligence.

Ehud Gavron (profile) says:

The "problem" of copyrights has a workaround.

Copyright issues are indeed aggravating, but there are developed mechanisms dealing with issues created by copyright. Fair-use in the US is one. Licenses that impose restrictions exist BECAUSE of copyright. No copyright, no “right to release under a license.”

The terms of the license, and I’ll speak to the GPL here, is that it allows use of the content, copying, including, re-using, etc. AS LONG AS the person doing so complies with the term of the license.

When Copilot provides a verbatim copy of a project “piece” (be it a file, a commit, a branch, trunk, archive, etc.) and does so in violation of the license and is not fair-use, that is a problem.

Analogy: When I sign a non-disclosure agreement (NDA) the parties generally agree that non-public information disclosed pursuant to that agreement will not be disclosed, unless it is publicly disclosed –through no action of their own– and otherwise available.

Copilot effectively renders the “secret” material public, license-free, obligation-free. I can’t copy module matrix_rewrite_X from the repository without being subject to its license terms. These may be incompatible with other licenses in the same project.

If I word my query right and Copilot gives me that same module but absent any license restrictions, I’m not only to free to use it, but free to provide it to others on my new website “UnlicensedPreviouslyLicensed.com” (UPL is also unlicensed practice of law, which –as I am not a lawyer– I’m not doing here.)

It’s great to blame MS or GH much like people blame Mr. Musk for something Tesla did or blame Mr. Bezos for screwing over ULA by having Blue Penis Origin focus on getting him into “space” before Mr. Branson instead of developing the BE-4 engine, but that’s not who or where the problem starts or ends, and blaming them is just misguided fingerpointing.

If MS-owned-GH wanted to take a proactive approach, they could request all [nonorphaned] projects to include a “COPYRIGHT.txt” and/or “LICENSE.txt” and have Copilot include that WHEN THE RESULT IS AN ENTIRE SECTION of the repository.
I don’t think that’s feasible for a variety of reasons, mostly because a)orphan works, b)why should X devs with Y projects and Z files do X*Z edits so that Copilot can avoid violating their rights — 3)the burden should be on Copilot.

Or, my favorite:
Copilot displays a disclaimer, links to the ORIGINAL GH files (as I believe it already does) and advises the recipient of the search result to comply with all agreements, regulations, licenses, etc. Is this stupid? Maybe, but so is the value of [I agree] in a court of law.


Anonymous Coward says:

Re: making an analogy with NDAs

Your analogy comes off as incomplete. I can see why Christenson was confused. You need to include an analogue to the output of a machine learning program on open source or libre software.

Here’s what I have in mind: Imagine that a very good security breaker has taken NDAed information from many companies and has trained a machine learning program on that information. Does the novel nature of the output of the ML program mean that the NDA obligations which apply to the input don’t apply to the output? With respect to the legality, I have no idea. With respect to the social acceptability, I would say that stuffing NDAed information into an ML program should not be permissible, unless the maker of the program asks each company for permission to release the output. Or alternatively, society could do away with the concept of NDAs. (I’m NOT saying that that’s a good idea!)

Similarly, I would argue that stuffing code licensed under GPLs (there are multiple versions) into an ML program doesn’t free Github from the license obligations, unless the maker explicitly releases all of the output under the latest version of the GPL. Or alternatively, society could get rid of the concept of copyright for software. (Again, whether that’s a good idea is a different question.) If Github can’t comply with this, then Copilot should not be allowed to exist. (Legally, Github can’t comply with the GPL’s obligations, because some of the code in the training set is proprietary despite being source-available.)

nasch (profile) says:

Re: Re:

Similarly, I would argue that stuffing code licensed under GPLs (there are multiple versions) into an ML program doesn’t free Github from the license obligations, unless the maker explicitly releases all of the output under the latest version of the GPL.

If I stuff my brain with a bunch of GPL code, do I then need to open source code that I write that was enabled by what I learned from that GPL code? Obviously not. Is the way a machine learning program operates different from a human in a legally significant way? If so, how?

Steven (profile) says:

Re: Re: Re: Not quite so clear cut

That’s not a clear cut as you make it sound. There are large projects that require ‘clean room’ standards for their contributors for that exact reason.

As an example, in the WINE project you are not allowed to contribute if you have ever seen windows source code. From their developer FAQ: “This would be anyone who has seen Microsoft Windows source code (stolen, under an NDA, disassembled, or otherwise).”

Clearly this is a corner case, and they are being overly cautious because of who they are dealing with, but this can be a legal grey area depending on the specifics.

Anonymous Coward says:

Re: Re: Re:

You raise a good point, but it’s not as obvious as you say. Replace “GPL code” with “information under NDA”, “trade secrets”, or “information about patented inventions”. If no one finds out, then you can’t get caught. But every once in a while the output will be so similar to an input that someone will find out. Who should be held accountable? Or should no one be punished at all? And suppose that Copilot were trained only on GPLed code. Legally, the output may be free from the GPL’s obligations. But in principle, wouldn’t you agree that this hypothetical version of Copilot breaks the social contract of the GPL, given that copylefted code is in a sense more free than public domain code? And it would be simple for Microsoft to exclude almost all GPLed code on Github by searching for licenses and the boilerplate that almost all GPLed code files include. A best effort shouldn’t be punished, after all. But Microsoft won’t even attempt to exclude GPLed code, will they?

Legally, Microsoft may turn out to be aboveboard. But there is also the issue of power. My concern is that Microsoft is deliberately breaking social contracts associated with copyleft licenses. Or more conservatively, Microsoft is being a jerk on purpose because Microsoft believes that authors of GPLed code can’t fight back in court. Microsoft would never put their own code into an ML program like Copilot, would they? It shouldn’t be a problem for Microsoft’s profits if the output isn’t similar to the input, right? There are no hard answers, but it’s not a stretch to conclude that Microsoft is acting in bad faith in ignoring copyleft license obligations.

TLDR: Legally, you may be correct. But socially, the answer isn’t as obvious as make out. Microsoft might be spiting license obligations (especially copyleft) on purpose. Microsoft certainly wouldn’t include their own code in Copilot, and they won’t voluntarily make any attempt to exclude GPLed code from the training set.

stine says:

Re: the problem is you've ignored the fact that this is a computer

The problem is you’ve ignored the fact that this is a computer. That computer should have been programmed with the details of every available license for GH code elements. That program could then search for elegant solutions in ?all? code, but only provide matching code from repositories with appropriate licenses. The fact that it does not do this indicates to me that the authors of this tool did not consider the code’s license important.

nasch (profile) says:

Re: Re:

That program could then search for elegant solutions in ?all? code, but only provide matching code from repositories with appropriate licenses.

This seems to assume that Copilot is just spitting out verbatim code snippets in response to a search query. I have not used it, but my understanding is it suggests code tailored to what the user is doing, which is informed by its training set but not necessarily identical to anything found therein. So the question isn’t as simple as whether it’s allowed to copy a particular piece of code.

Anonymous Coward says:

How and Why are different questions

This post’s comment re context is everything for me. It is entirely possible that Copilot would give me a solution I could just plunk down into my code without problem or thought.

… but if I don’t know why the code works, it is useless to me, because I’d then have to go back to Copilot the next time as well. Stack Overflow most often includes the “why” to what I want to know.

I could, though, mention some of the horrible, horrible Medium posts I’ve waded through only to find that it didn’t come close to answering my question.

Christenson says:

Clarifying the source licenses..

It does not appear that copilot is using any code that you could not hunt down in the public part of github, so none of it is “secret”.

In addition to “no warranty”, licenses say different things.
GPL says “You must provide access to the source code” and “require users to do the same”.
Others require you to acknowledge your contributors.

Both of those are scholarly courtesies that Copilot could easily reinforce, but go against the grain of commercial product development — think Windows or your favorite Internet of Garbage toy.

To the extent copilot does not identify GPL material and its source, I think copilot violates the GPL. Not that I think Mike looking up the 10 lines of how to do something is not fair use; it’s the repetition of that act a million times by copilot that changes the nature of the copying.

Christenson says:

Re: An interesting license for nmap...

As part of a studying project, I encountered the license for nmap, which specifically limits the use of its otherwise open code as follows:

Even though Npcap source code is publicly available for review, it is
not open source software and may not be redistributed or used in other
software without special permission from the Nmap Project. The
standard (free) version is usually limited to installation on five
systems. We fund the Npcap project by selling two types of commercial
licenses to a special Npcap OEM edition:

That’s a use case Microsoft should have to agree to, or keep the code off of github…

Christenson says:

Re: Re: Re: yet another license twist

RFC 8446 (TLS 1.3) at https://datatracker.ietf.org/doc/html/rfc8446 has an interesting twist on licensing — Schroedigner’s permission.

This document may contain material from IETF Documents or IETF
Contributions published or made publicly available before November
10, 2008. The person(s) controlling the copyright in some of this
material may not have granted the IETF Trust the right to allow
modifications of such material outside the IETF Standards Process.
Without obtaining an adequate license from the person(s) controlling
the copyright in such materials, this document may not be modified
outside the IETF Standards Process, and derivative works of it may
not be created outside the IETF Standards Process, except to format
it for publication as an RFC or to translate it into languages other
than English.

naoEntendo (profile) says:

It's not the same, but it was bound to be tried eventually

While CoPilot is an interesting idea that was bound to be tried sooner or later, I believe it’s currently, especially in how it was/is trained, more like ClearView than StackOverflow.

ClearView scrapped images from all over the web without the owners permission, or even awareness.

CoPilot scrapped source code from GitHub without the owners permission, or even awareness.

Some will argue that buried in the tens of thousands of pages of legalese governing GitHub, legalese I might remind you Microsoft can alter at it’s unilateral whim anytime it wants to, gives Microsoft the right to use the source code as it sees fit. Others will argue that it’s fair use (isn’t that what ClearView is arguing, that the use of people’s images is fair use?). I think those arguments miss the point.

Microsoft should have been open and above board before they used other people’s code as grist for CoPilot’s mill. I believe that they weren’t because they feared that folks would start deleting their repositories en masse. What good is a source code repository without much source code? Just as SourceForge.

Some folks may have pulled their repositories, other repositories are public domain, and still other authors might not have cared. Microsoft is counting on the fact that they are Microsoft with all of the clout and lawyers that entails to see them through.

Instead, the Software Freedom Conservancy is urging all open source projects to pull their source code off of GitHub. People are now wondering, “If Microsoft can do this with my source code, what’s next?”. And the knowing grey beards in the back are saying, “See I told you this would happen when Microsoft bought GitHub.”

I wouldn’t be surprised if the next round of opensource licenses contains a new clause strictly prohibiting use of the source code in any ML data sets.

The author opined that CoPilot is a lot like Google’s scanning project, now Google Books. CoPilot (C) is different than Google Books (G) in many important ways.

G allows analysis to be performed on book contents, such as how often “he” shows up, or how many works reference “boats”. When G lets you browse a particular book, you know the book’s name, author, etc. G limits the amount of access based on the rights to the book. Public domain books are free to read/download in their entirety. Other books are limited from between a few lines, to many/most pages, depending on the wishes of the copyright holder.

C on the other hand doesn’t let users perform analysis, although it does analyze software repositories. It doesn’t let the user know which work or the author, or the license the answer comes from. C doesn’t let you see the context, it just presents an answer, typically verbatim copying from some other authors work. C doesn’t respect the wishes of the code’s author or license.

It’s a shame Microsoft decided to well, be Microsoft. Hopefully the courts will decide fairly (yea, I know, there’s not the best track record there) and that when someone else goes down this path, they do it with a lot more consideration.

Anonymous Coward says:


Some will argue that buried in the tens of thousands of pages of legalese governing GitHub, legalese I might remind you Microsoft can alter at it’s unilateral whim anytime it wants to, gives Microsoft the right to use the source code as it sees fit.

What happens when somebody uploads a large codebase like Linux to GitHub, without the permission of every copyright holder? Are such uploads against the terms of service?

Ehud Gavron (profile) says:

This isn't just limited to software...

Arstechnica reports that ShutterStock may have the same problem because they used their database of images to train an AI to provide… images.

| What happens when somebody uploads a large codebase like Linux to GitHub, without the permission of every copyright holder? Are such uploads against the terms of service?

Linux (or GNU/Linux if you prefer) went from GPLv2 to GPLv3 over a period of years because it was required to get EVERYONE who had contribute GPLv2-licensed code to agree to “GPLv3 [or greater]”. I don’t think “every copyright holder” in anything that large is likely to be
– reachable
– amenable
– able
– responsive within a set timeframe
for that to occur.


cls says:

Re: analogous to orphaned works

> I don’t think “every copyright holder” in anything that large is likely to be
– reachable
– amenable
– able
– responsive within a set timeframe
for that to occur.

Directly analogous to orphaned works in books, etc.

Could be solvable with shorter copyright duration and non automatic renewal.

TKnarr (profile) says:

Stackoverflow vs. Copilot

I think the comparison to Stackoverflow answers misses an important point: when you take an SO answer you’re doing so with at least the implicit consent of the person who wrote the answer (they posted their answer in response to a question, they have to at least consider that the answer will be used by the questioner). That’s a reasonable basis for you to assume the code there can be used (unless the respondent said otherwise). Copilot OTOH spits out code with no involvement from the authors of the code it was trained on, without them knowing their code would be used to provide the answer, and without any concern on Copilot’s part about what conditions might be attached to the code. You have no idea if the authors of the code would’ve consented to provide it in answer to a question if asked. You don’t even know that it’s not a copy of proprietary code. Just being on the public portion of Github doesn’t imply anything, books after all are publicly readable by anybody who picks one up but that by itself doesn’t grant anyone any rights to take the text they read and use it elsewhere.

The problem isn’t copyright, or Github. The problem is that Copilot is presented as something it isn’t. It’s presented as “AI”, something capable of coming up with original work on it’s own, when what it is is simply a very complex program that takes a representation of data that it’s been given, transforms it in a deterministic fashion based on rules it has and produces the resulting transformed material as output. It’s data set is large beyond comprehension, it’s rules are so complex no human can understand them, it can spot patterns in it’s data that no human would ever notice and add a rule describing them to it’s rule set, but it hasn’t yet passed the threshold where it can create original material (material that wasn’t provided to it as input or deterministically derived from it’s data according to rules in it’s rule set).

Steven (profile) says:

Not sure I buy the draining open source community argument.

I am a full time software developer. More than 15 years in the industry.

I have on rare occasion contributed to open source software. I would be one of those ‘low-touch interactions’. I have often wanted to get more involved with open source projects, but I haven’t largely because of the time it would take.

I can’t think of a single time I’ve been drawn to an open source project because I searched for help putting together some small number of lines of code. While I understand there are some libraries that are very small I generally wouldn’t use one that I could replace with a few lines of code, barring some very specialized task.

I can’t think of a single interaction with an open source project I would have, that copilot would get in the way of.

Almost all my use of open source code is going to be in the form of a library or tool.

If I’m searching for some small code solution I’m not looking for open source code. I’m likely going to find the solution on a blog or something like stackoverflow.

If I’m searching for functionality I think a library/tool would be the right thing for, I’m looking for a library/tool and wouldn’t be in a situation to use copilot anyway.

If I’m using a library/tool and have an issue I’m either going to google ‘issue + library name’ or go directly to the site for the library/tool.

I’d like to hear of a situation where copilot would get in somebodies way of interacting with open source. I just don’t see it.

pegr says:

The author’s legal analysis is weak

Developers miss a very important point about software. Not all components of software are copyrightable. If someone lifts your sort routine, they have not violated your copyright.

Copyright applies only to creative expressions. Purely functional expressions are not subject to copyright.

Now if a search engine can generate code to do something mundane, how creative were all the original expressions the search engine indexed? If the expression is commonly used in many applications, I’m betting it’s purely functional in nature.

The argument doesn’t even need to mention “Fair Use”.

Anonymous Coward says:


If someone lifts your sort routine, they have not violated your copyright.

This is overly simplistic, and only a court can really make that determination. Things like variable names and comments could be considered creative enough to be copyrightable.

Copyright applies only to creative expressions. Purely functional expressions are not subject to copyright.

Prior to the late 1970s, all software was considered “purely functional” and was not subject to copyright. I still think that’s the most sensible view.

Kinetic Gothic says:

So I'm not alone..

I’m glad I’m not the only one who’s thought of the interaction between copyright and face recognition. When I first read about Clearview here, my first reactions was “they’re making a commercial product based based on photos that I own the copyright to, they’re not much different from the people at Printerval who scraped my t-shirt graphics from Spreadshirt.” a C&D asking them to delete any works based on my feeds was certainly on my mind.

Anonymous Coward says:


I think some of the anger around Copilot is the fact that if it were an individual or small startup behind it attempting to train a machine-learning model from public source code on GitHub and other public sources that quite possibly they would be sued by many stakeholders of copyright (including possibly Microsoft) and it would be potentially be a difficult hole to get out from.

Add Your Comment

Your email address will not be published. Required fields are marked *

Have a Techdirt Account? Sign in now. Want one? Register here

Comment Options:

Make this the or (get credits or sign in to see balance) what's this?

What's this?

Techdirt community members with Techdirt Credits can spotlight a comment as either the "First Word" or "Last Word" on a particular comment thread. Credits can be purchased at the Techdirt Insider Shop »

Follow Techdirt

Techdirt Daily Newsletter

Techdirt Deals
Techdirt Insider Discord
The latest chatter on the Techdirt Insider Discord channel...