Why Powerful But Hard To Detect Backdoors Could Become A Routine Problem For Open Source Projects Because Of AI
from the nebraska-problem-2.0 dept
Last year, Andres Freund, a Microsoft engineer, spotted a backdoor in xz Utils, an open source data compression utility that is found on nearly all versions of GNU/Linux and Unix-like operating systems. Ars Technica has a good report on the backdoor and its discovery, as well as a visualization by another Microsoft employee, Thomas Roccia, of what Ars calls “the nearly successful endeavor to spread a backdoor with a reach that would have dwarfed the SolarWinds event from 2020.” A post on Fastcode revisits the hack, and draws some important lessons from it regarding open source’s vulnerability to similar attacks and how the latest generation of AI tools make those attacks even harder to spot and guard against. It describes the backdoor’s technical sophistication as “breathtaking”:
Hidden across multiple stages, from modified build scripts that only activated under specific conditions to obfuscated binary payloads concealed in test files, the attack hijacked SSH authentication through an intricate chain of library dependencies. When triggered, it would grant the attacker complete remote access to any targeted system, bypassing all authentication and leaving no trace in logs.
Just as important as the technical skill involved was the level of social engineering deployed in a coordinated, planned fashion across years:
“Jia Tan,” a developer persona created in January 2021 who would spend the next two years executing one of the most patient social engineering campaigns ever documented. Beginning with small, helpful contributions in late 2021, Jia Tan established credibility through hundreds of legitimate patches across multiple projects. This wasn’t a rushed operation: the attackers invested years building an authentic-looking open source contributor profile.
But Jia Tan didn’t work alone. Starting in April 2022, a coordinated network of sockpuppet accounts began pressuring Collin [the xz Utils maintainer]. “Jigar Kumar” complained about patches languishing for years, declaring “progress will not happen until there is new maintainer.”
This is a familiar issue in the open source world, sometimes called the “Nebraska problem” (pdf) after a famous xkcd cartoon that showed diagrammatically “all modern digital infrastructure” held up by “a project some random person in Nebraska has been thanklessly maintaining since 2003”. Those behind the xz Utils hack exploited the fact that it depended on one person who was struggling to keep the project going as an unpaid hobby, and without adequate support. Once “Jia Tan” had established credibility through hundreds of useful patches, sockpuppets pushed for the existing xz Utils maintainer to grant almost complete control to this willing and apparently skilled helper, including commit access, release privileges, and even ownership of the project website. With that power, the backdoor could be deployed, as outlined in the Ars Technica article.
The Fastcode post points out that however bad things were previously in terms of vulnerability to sophisticated social engineering hacks of the kind employed for the xz Utils backdoor, today the situation is far worse because of the new large language models (LLMs):
The xz attack required years of patient work to build Jia Tan’s credibility through hundreds of legitimate patches. These [LLM] tools can now generate those patches automatically, creating convincing contribution histories across multiple projects at once. Language models can craft personalized harassment campaigns that adapt to each maintainers specific vulnerabilities, psychological profile, and communication patterns. The same tools that help developers write better code are also capable of creating more sophisticated backdoor. They can produce better social engineering scripts. Additionally, these tools can generate more convincing fake identities.
The timeline compression is terrifying. What took the xz attackers three years of careful reputation building, LLMs have accelerated in months or even weeks. Multiple attack campaigns can run in parallel, targeting dozens of critical projects at the same time. Each attack learns from the others, refining its approach based on what works. The sockpuppet accounts that pressured Collin were crude compared to what’s now possible. AI driven personas can keep consistent backstories and engage in technical discussions. They can also build relationships over time, all while being generated and managed at scale.
The current exploitation of open source coders’ goodwill already endangers the whole of modern digital infrastructure because of the Nebraska problem, but now: “We’re asking people who donate their evenings and weekends to defend against nation-state actors armed with the most sophisticated AI tools available. This isn’t just unfair; it’s impossible.”
There is only one solution that stands any chance of being effective: to bolster massively the support that open source maintainers receive. They need to be properly financed so as to enable them to create broad teams with the human and technical resources to spot and fight LLM attacks of the kind that will come. The sums required are trivial compared to the trillions of dollars of value created by open source software, selfishly used without payment by governments and companies alike. They are also tiny compared to the losses that would be incurred by those same governments and companies around the world if such LLM attacks succeed in subverting key software elements. What’s frustrating is that this problem has been raised time and time again, and yet little has been done to address it. The xz Utils hack should be the digital world’s final wake-up call to tackle this core vulnerability of the open source world before it is too late.
Follow me @glynmoody on Mastodon and on Bluesky.
Filed Under: ai, andres freund, backdoor, gnu, goodwill, lasse collin, linux, llms, maintainer, microsoft, nebraska problem, patches, social engineering, sockpuppets, ssh, unix, xkcd
Companies: microsoft, solarwinds


Comments on “Why Powerful But Hard To Detect Backdoors Could Become A Routine Problem For Open Source Projects Because Of AI”
Not convinced
The commenters on the lwn.net website don’t think very much of this FastCode story. They’re certainly not panicking as you might expect if they accepted the reasoning, several there calling it clickbait.
Re:
Yes, many major open source projects, like Curl, has objected that AI can produce decent complex patches, and any maintainer could spot theses as AI generated pretty easily, even when pressured to integrate them.
Sure, AI can produce convincing messages that could harass enough a maintainer and push him to make some errors of judgment, but when knowing of what AI is capable (and most open source devs know that is much less that what AI companies are selling) and recognizing common patterns for AI generated content, AI is certainly more effective to waste maintainer time than planting backdoors on code “with many eyes” reviewing it.
Is this article written by an AI infiltration team?
Wait, the only solution to sophisticated state actors infiltrating open source projects is to make sure that people with lots of money (ie: state actors) are able to finance and support those projects (possibly including actually building the teams working on it), with the added leverage that you now depend on them for financial support?
I’d almost think the article was written by one of these AI infiltrators.
Re:
What the fuck, state actors? No, just give some funding to open source projects. Not state control, not corporate control.
Then again, not sure why i am bothering here, as i rather suspect some new-looking ACs here are some combination of: LLMs, state actors or their toadies, corporate actors, bad faith actors, trolls, or people with some sort of agenda and/or ideology.
Okay, that’s not a bad idea, but I’m not sure it’s the “only” thing that could work. I certainly expect it would help to stop treating it as normal to have to deal with tens of thousands of lines of unverifiable code to do fuck-all.
Configure scripts were always an obvious risk. 50,000 lines of auto-generated shit to deal with every buggy platform that hasn’t been seen for 30 years. “Checking for stdint.h…” (standard since 1999!), “Checking for a Xenix pre-ANSI C compiler…”, stuff like that; hell, I’m surprised it’s not checking for 6-bit bytes and PDP-endianness.
Hey, don’t even worry, just pipe the thing directly to your shell from wget (itself built via a 74,300-line configure script). Then the scirpt can go back out to the web, to fetch dozens of often-trivial dependencies from sites that may or may not still exist.
Why would you limit this to open source projects.
It could just as easily and much more undetectably be added to closed source/proprietary projects.
Re:
Sure, like the Juniper backdoor. But it seems somewhat redundant to say that software we’re prohibited to know anything about might have behavior we don’t know about. People who run such software have basically already given up (unless maybe it’s in a very restrictive sandbox or its behavior is otherwise limited or characterized).
Re:
It probably couldn’t because those projects don’t accept code from external sources in the first place and won’t give anyone access who doesn’t work for the company.
Proprietary code’s vulnerable to an even simpler attack: bogus employees. We’ve already seen cases of this. Don’t bother with lengthy prep, just get your guy hired as a contract employee at the target. If they aren’t given the necessary access immediately it won’t take more than 6 months to a year to get it thanks to high churn and overwork.
Re: Re:
That’s an excellent point. Insider attack is a (relatively) cheap, easy, powerful tactic. That’s why intelligence agencies use it on each other.
And it’s going to get worse as companies jettison human programmers and deploy AI, because someone is going to — if they haven’t already — train an AI to execute a long-game insider attack against code. Yeah, it probably won’t be able to handle the human factors part of this as well as the people behind the xz attack, but: if its “coworkers” are also AI, it won’t have to.
Not that this will slow down the sociopaths running AI companies: they’re willing to wreck the Internet, the power grid, the environment, everything so that they can keep feeding their egos and greed. They simply don’t care who or what they hurt or how badly. And all the people who are just trying to create things, from programmers to artists, are going to suffer as a result.
Re: Re: Re:
That seems a lot harder than training the system to find existing bugs. No “inside access” required, and it might even be better to ignore the source code—the binary code could act in subtle ways not visible from source.
The same thing could be used for defense, eventually. But the attackers are always much more motivated, so it’ll take a while to “trickle down”.
Re: Re: Re:2
“That seems a lot harder than training the system to find existing bugs.”
Maybe. But I’m not so sure about that. Attackers have first-moved advantage, and an AI attacker (or an AI-assisted human attacker) can take as much time as they want to analyze code as thoroughly as they want before launching an attack…whereas a defender needs to detect it before it takes effect.
“… it might even be better to ignore the source code—the binary code could act in subtle ways not visible from source.”
That’s where Ken Thompson’s famous Turing Award lecture, “Reflections on Trusting Trust” comes in — he did exactly that, 40+ years ago…and none of us caught him, despite having far fewer pieces of code to examine and despite lavishing attention on pretty much all of them. It’s getting harder to pull this off today because of verified builds and cryptographic checksums and so on — but it’s not impossible.
Re: Re: Re:3
“Trusting Trust” is a cool paper, but I was thinking more about stuff like when the Linux developers noticed the compiler was deleting their NULL-pointer checks (because the checks had been written incorrectly, although they didn’t seem to like that explanation). Also actual and mundane compiler bugs—although who’s to say that any one of these wasn’t an intentional attack? The Thompson attack was not a subtle miscompilation.
The correctness proof for the seL4 kernel is based on the binary code, for reasons such as these.
Re: Re:
Ha, ha. Proprietary projects often have a ton of code from external sources. Check the manuals and copyright notices sometime. Your home internet router probably includes a copy of the GNU General Public License and an offer for source code. Microsoft and Apple will give you many instances of “Redistributions in binary form must reproduce the above copyright notice…”.
Why write and debug a red-black tree when Neils Provos has already done it? I’m not sure we even know who wrote sys/queue.h, the widely-used linked-list code; we just know that it comes from 4.4BSD and The Regents of the University of California hold the copyright. It would actually be weird to find a major proprietary project that didn’t include code found online; they just don’t always advertise it, especially when not required to.
Okay, it might be a little harder to execute an evil plan to get vulnerable code into proprietary software. You’ve gotta make something vaguely useful, get it onto GitHub or into a BSD or Linux distribution, and then wait and hope. But you’re kidding yourself if you think the proprietary software developers are carefully scrutinizing the code they pull in (although their lawyers may be scrutinizing the copyright notices).
Re: Re: Re:
Oh yeah. This 100%. Watching the industrial and utility industry absolutely lose their shit when SBOMs were mandated by utilities and agencies on 20 year old products that are still in production.
Re: Re: Re:
Yes, they use external software, but compromising that software is an attack on that software, not the proprietary software itself. The company almost certainly doesn’t modify the external software, so push access to the company’s repos won’t let you introduce vulnerabilities in the external software.
Creating a trojan-horse package and getting it used is certainly an attack, and we’ve seen plenty of those (see “supply-chain attack”), but it doesn’t involve AI nor will AI make those attacks easier.
Re: Re: Re:2
Not necessarily, although splitting this hair further won’t do much to comfort anyone affected by such an attack. It could be said that the xz backdoor wasn’t an attack on xz; it didn’t, after all, introduce any exploitable vulnerability into xz. Rather, the attack was on OpenSSH, via xz (liblzma), without anyone involving OpenSSH’s repositories, build servers, or developers.
Remember when SSH was proprietary? (For those who are too young: that was from its release in 1995, until sometime in 1999 when OpenSSH was released; OpenSSH quickly became the dominant implementation.) The xz attack could’ve gone exactly the same way, had the proprietary SSH been linking against liblzma. For all we know, similar backdoors could be in effect right now in some proprietary software. Who’d know? Users of such software generally expect it to be inscrutable, so might be quite a bit less likely to notice “odd” things like high CPU usage (hey, it’s gotta fetch ads, transmit telemetry, check for cheating, mine Bitcoin…).
Maybe, maybe not. “A.I.” is just marketing bullshit right now, and no actual A.I. exists (as far as we know). The things currently branded as “A.I.” could potentially make them easier in various ways. I consider the premise to be semi-plausible science fiction, rather than pure fantasy.
Re: Re:
Right. It’s not like pro-authoritarian companies would ever want to add such backdoors to their closed source code themselves, is it? Fucking dipshit.
Re:
Closed-source is… closed. And self-funded.
Should i donate more to Microsoft, or what the fuck?
Too, too late, soz.
No one is going to come in and start financing open source development, there’s no way to monetise your investment if the product is given away.
It was always stupid, giving away your time and effort to produce something for the “community” whilst the rest of us were trying to make a living.
You devalue the work of all developers by giving it away.
Re:
Indeed: How DARE those of us contributing to FLOSS projects take our time and energy, do something we enjoy, and then allow it to enrich the rest of humanity as well. /s
A funny story:
As a software developer who has worked on both proprietary code, and FLOSS projects, I have encountered an endless parade of terrible code. This is not restricted to either of those two groups.
However, the very best code I have found is always in FLOSS projects.
Re:
Your work has no value.
Be seeing you.
Re:
Fuck you, Mr. Moneybags. FYI, you devalue the reputation of all developers with your entitled attitude that if someone’s too poor to pay for software, then they don’t deserve to have it.
Re:
You might want to get rid of everything you use that has a chip in it, they likely all have some open source software in them.
Re:
Oh, I quite agree. Yes. You’re definitely right. Let me help you out as best I can.
First, you’re going to have to stop using web sites like this one. A heck of a lot of those are run by open source projects like Apache and Nginx, and a lot of use open source Javascript frameworks.
Wait, wait! Don’t leave yet. There’s more.
Most of the Internet’s email moves around thanks to sendmail, postfix, exim, and courier — all open source projects. So you’ll have to abandon that.
Also, DNS is handled in large part by BIND and unbound — also open source. So you’re going to have to get used to typing IP addresses. Sorry.
But that won’t matter much because most of the serious computing on the Internet happens on Unix and Linux systems — also open source. You’ll definitely want to avoid using any of those.
You see, those of us who built this network — and after 45 years in the field, I can safely count myself as one of those people — not only invented open source, we used it to build the most successful and largest project in the history of computing. Open source — like The Force — surrounds and pervades the Internet. So if you really hate it that much, you’re going to have to go offline.
Bye now.
Re:
You’re another fucking moron in today’s comments, got it.
Re:
I can’t tell if this is sarcasm or not.
WYD Moody? Who’s paying for the open source hit piece?
Re:
Ooh, bullshit from a different direction! It must be Flood The Zone With Shit Day.
xz Utils is a good cautionary tale. What does that have to do with AI?
Re:
“A.I.” could pretend to be many non-existent people, sending innocuous patches and otherwise participating in mailing lists to establish credibility; and then a bad person could use the same identity to sneak some undesirable code in.
Maybe. In theory. It sounds like a “movie-plot threat” to me. The people doing this have budgets to hire actual stooges to do the same, and could’ve been doing it for decades. There was “The Linux Backdoor Attempt of 2003”; it didn’t use reputation in this way, but it’d be reasonable to think that such cons started around the same time.
Re:
Tell us more about how you didn’t read the article.
It’s something to watch out for but I think it greatly overestimates LLMs’ abilities to contribute quality code and pass for human for extended periods of time.
“Jia Tan” orchestrated a con over a period of years. LLM chatbots can’t maintain a consistent reality from one answer to the next.
This isn’t to say LLMs aren’t going to put real strain on the Nebraska Problem; they are. But it’s mostly going to be because they’re contributing a glut of crappy, poorly-explained code that taxes project maintainers’ time and energy.
Re:
Humans are pretty famously inconsistent too, particularly over multi-year periods, so I don’t think that’s gonna be what gives it away. Maybe the sycophancy will. Or just that their text and patches will be kind of nonsensical (then again, the messages posted here by “ECA” are rarely comprehensible—”LINKS into Pictures”?—and I assume they’re from a human).
How many Backdoors in Windows,
and the Current HTML Which Will let you insert LINKS into Pictures.
All “AI” models are also crazy susceptible to manipulation and infection themselves.
Time to dump that dumpsterfire.
Nice bullshit.
The first place where AI code is being used is at Google and Microsoft.
Didn’t take long to see Google get fucked real nice in the ass with security holes. And Microsoft ? Bwahahahaha.
Those will burn first and hard, far long before an open source project gets any closer to this problem.
Perhaps you should do your fucking homework and check the news a little.
Re:
And yet Android devices can still hook up to cheap third party Bluetooth headphones because Alphabet didn’t feel that shutting down interconnectivity was a lot easier than fixing its security. How do like them Apples?