Concerns about AI code in the kernel

From: Ellie

Date: Wed May 13 2026 - 04:24:20 EST

Dear Linux Kernel Mailing list,

My deepest apologies for taking up your time, Greg suggested to me at some point to send this to the LKML. (As a response to a specific patch, but since I don't think I was subscribed in time to do so, sorry for sending it as a standalone message instead.)

I'm concerned about the Linux Foundation's and the Linux kernel's LLM use, specifically the use of so-called "generative AI" output in the actual code of the Linux kernel. I'm not really concerned much about using LLMs pointing out bugs, for what it's worth. And I'm not a kernel contributor, so I apologize again for bringing this up at all.

But let me explain why:

First, I'm not a lawyer, just a FOSS contributor concerned with the future of the ecosystem and the Linux kernel. But I think the GPL matters and that attribution matters to keep FOSS alive and healthy, which is a concern beyond pure legal questions.

It seems like through public debate, the plagiarism aspect of LLMs in regards to their training data seems to be widely known at this point, specifically that using LLM code might may mean that random snippets originate from whatever it was trained on without the proper attribution and proper licensing being retained.

However, it seems like the amount of plagiarism may be underestimated:

https://github.com/mastodon/mastodon/issues/38072#issuecomment-4105681567 This video clip here by EU lawyer Chan-jo Jun was the one that made me the most concerned. He reviews Co-Pilot, and especially the second example in the clip seems like a quite direct copy of a specific source, without much of a prompt that I would have thought to cause it to plagiarize.
https://dl.acm.org/doi/10.1145/3543507.3583199 As far as I can tell this field study suggests that even when not baited to plagiarize, there is a regular level of plagiarism in the output of at least 2-5%.

https://lcamtuf.substack.com/p/large-language-models-and-plagiarism This case study appears to show how even punctuation and other very basic elements can apparently end up plagiarized, even if the model wasn't prompted and baited into completing a known text.

https://www.pcgamer.com/software/ai/microsoft-uses-plagiarized-ai-slop-flowchart-to-explain-how-github-works-removes-it-after-original-creator-calls-it-out-careless-blatantly-amateuristic-and-lacking-any-ambition-to-put-it-gently/ This seems to be a high-profile example, run into by Microsoft no less, of somebody using gen AI to create something from scratch without intention to plagiarize, where the end result was an apparently significantly plagiarized copy based on a single creator.

https://openreview.net/forum?id=TatRHT_1cK This scientific paper seems to suggest that not only is "memorization", that is reproduction of the training material, inherent but it also grows with model size and therefore might be an escalating problem: "On the whole, we find that memorization in LLMs is more prevalent than previously believed and will likely get worse as models continues to scale, at least without active mitigations."

https://www.theatlantic.com/technology/2026/01/ai-memorization-research/685552/ This article and examination seems to suggest that even with the latest models, the mitigations to prevent them from directly plagiarizing larger sources are insufficient.

https://www.sciencedirect.com/science/article/pii/S2949719123000213#b7 This study seems to suggest that not only is the memorization and plagiarism inherent, but that it is required to make a high-performing model: "In this work we explored the relationship between discourse quality and memorization for LLMs. We found that the models that consistently output the highest-quality text are also the ones that have the highest memorization rate." I think this raises direct questions for the often-praised Claude Code, and whether it plagiarizes even more than previous models.

https://www.twobirds.com/en/insights/2025/landmark-ruling-of-the-munich-regional-court-(gema-v-openai)-on-copyright-and-ai-training This court ruling seems to suggest that there is not at all any clarity yet on whether AIs will be regarded as transformative enough that the assumption that it'll not be regarded as derivative of training data seems potentially untested.

Another thing that worries me is apparently LLM code can lead to a higher rate of hidden bugs: https://www.coderabbit.ai/blog/state-of-ai-vs-human-code-generation-report There are at least some studies suggesting that this happens even when a human reviewer is in the loop.

Anyway, I understand it's perhaps weird to point to studies when you're the ones doing the actual Linux kernel work for first-hand experience.

I'm just seeing all these incidents play out and it concerns me that perhaps the risks of AI are being underestimated here. I assume once the Linux kernel contains a lot of "LLM code", it will be hard to undo once new work was based on top of that.

There are also articles exploring the moral side beyond just legal concerns: https://writings.hongminhee.org/2026/03/legal-vs-legitimate/ I think it's legitimate to ask if this will mean the end of any FOSS licenses meaning anything, especially the GPL, if the ecosystem as a whole decides to go down this route. It makes me wonder what this will mean for the willingness of volunteers to still contribute, if not even basic attribution is going to be given.

I'm also concerned with some Linux kernel developers in the past apparently arguing that an LLM code ban is useless since it's not enforceable, but the same seems to be true with e.g. manually copying and stealing proprietary code without attribution from other sources and integrating it into the Linux kernel. I don't get why that shouldn't be banned anyway if there are legitimate concerns, if anything then to shift the responsibility from the project to the violating contributor.

I'm also concerned with the common argument that while plagiarism may be common, the snippets would be considered too short to cause legal trouble. First of all, there seem to be high profile incidents where a larger segment was unknowingly plagiarized by LLM output rather than just small snippets. Second of all, it ignores the concerns I mentioned above that extend to morals and the meaning of the GPL and the future of FOSS, beyond the mere question of who can be sued because of it.

It also seems like if an LLM was made with properly licensed data, most of these concerns would go away. Therefore, not insisting on this basic requirement before using it for code contributions may enable unethical actors to ramp up the plagiarism, encouraging non-LLM contributors to no longer hold back from that either. The ecosystem should perhaps push for that aspect of LLMs to get better, rather than hope and pray it won't cause trouble in the future. If any FOSS player could hope to have any worthwhile pressure here, it would probably be the Linux Foundation.

Anyway, I hope some of this is useful to anybody here and to the Linux foundation, and that you'll find some aspect about this actionable.

I apologize for the inconvenience and for bothering any of you, since I'm an outsider to this community. I hope my email can inspire useful discussions.

Regards,

Ellie