Re: LLM based rewrites

From: Theodore Tso

Date: Tue Mar 10 2026 - 00:52:48 EST


> >The fact that every version of chardet was surely in its training data
> >is not deemed to be relevant.
>
> That's a question for the lawyers and the courts, really. But it is
> most definitely *not* clean room. That being said, clean room is
> certainly not the only way to rewrite software that can pass legal
> muster, but it is the gold standard

Well, given that researchers were able to elicit 96% of Harry Potter
and the Sorcerer's Stone from Claude 3.7 Sonnet[1], the question I
have is that if you have one LLM instance create a specification from
looking at the code that you are trying to clone, and then you have a
second LLM instance that was trained on the code you are trying to
clone, and then fed the specification --- regardless of whether this
can be considered "clean room" from a process perpsective, the other
question is just whether there is enough similarity in the actual
*results*, that could also be a problem.

[1] https://arxiv.org/html/2601.02671v1

Of course, we could imagine using the LLM to incrementally rerite the
C code that was elicited from the specification if the results are too
closely to the source program --- that is, "Hey ChatGPT, please file
off the serial number so the source code looks nothing like the GPL
code that I'm trying to rip off."

The thing is, though, this is something that humans could do as well,
It wouldn't surprise me if there are cases of "clean room
implementation" where there might be some incremental rewriting; and
proving that it wasn't a strict clean room procedure might be quite
difficult. It's just that with AI, it might be easier to do things at
scale.

- Ted