Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup

From: Li Chen

Date: Mon Jun 01 2026 - 11:15:22 EST

Hi Mateusz,

---- On Thu, 28 May 2026 20:55:32 +0800 Mateusz Guzik <mjguzik@xxxxxxxxx> wrote ---
> On Thu, May 28, 2026 at 05:52:21PM +0800, Li Chen wrote:
> > This RFC adds spawn_template, a userspace-controlled exec acceleration
> > mechanism for runtimes that repeatedly start the same executable with
> > different argv, envp, and per-spawn file descriptor setup.
> >
> > The main target is agent runtimes. Modern coding agents repeatedly start
> > short-lived helper tools such as rg, git, sed, awk, python, node, and
> > shell wrappers while they inspect and edit a workspace. Those runtimes
> > already know which tools are hot, and they are also the right place to
> > decide policy. The kernel does not choose names such as rg, git, or sed.
> > Userspace opts in by creating a template fd for one executable, then uses
> > that fd for later spawns. Launchers, shells, and build systems have a
> > similar repeated-startup shape and could use the same primitive, but the
> > agent runtime case is the main motivation for this RFC.
> >
> [..]
> > A typical agent runtime would keep one template per hot executable and
> > still build argv, envp, cwd, and pipe wiring for each tool call:
> >
> > rg_tmpl = spawn_template_create("/usr/bin/rg");
> >
> > for each search request:
> > out_r, out_w = pipe_cloexec();
> > err_r, err_w = pipe_cloexec();
> > actions = [
> > FCHDIR(worktree_fd),
> > DUP2(out_w, STDOUT_FILENO),
> > DUP2(err_w, STDERR_FILENO),
> > ];
> > child = spawn_template_spawn(rg_tmpl, rg_argv, envp, actions);
> > close(out_w);
> > close(err_w);
> > read out_r and err_r;
> > waitid(P_PIDFD, child.pidfd, ...);
> >
> >
> [..]
> > The cached state is intentionally small. The template fd keeps the opened
> > main executable file, an optional absolute path string, the creator
> > credential pointer, and the deny-write state. The executable identity key
> > records device, inode, size, mode, owner, ctime, and mtime, and is
> > rechecked before cached metadata is used. The ELF cache keeps only the
> > main executable's ELF header, program header table, and program header
> > count.
> >
> > cached in this RFC not cached in this RFC
> > ------------------ ----------------------
> > opened main executable PT_INTERP metadata
> > executable identity key shared-library graph
> > main ELF header VMA layout metadata
> > main ELF program headers cross-process metadata sharing
> > creator cred pointer
> > deny-write state
> >
> > This RFC does not cache ELF interpreter metadata, shared-library
> > dependency state, or derived mapping-layout state. Shared-library
> > resolution is dynamic linker policy and depends on LD_LIBRARY_PATH,
> > RPATH, RUNPATH, /etc/ld.so.cache, mount namespaces, and secure-exec
> > state. It also does not share cached executable metadata between template
> > fds created by different processes. Each template owns its small cached
> > metadata object in this RFC.
> >
> > Performance
> > ===========
> >
> [..]
> > Workload Calls subprocess spawn_template time_s Delta
> > (workers) calls calls/s calls/s seconds
> > 1x16 6144 411.04 420.32 14.95/14.62 +2.26%
> > 2x8 6144 666.78 690.08 9.21/8.90 +3.49%
> > 4x4 6144 955.61 1003.25 6.43/6.12 +4.99%
> > 8x2 6144 1048.25 1069.18 5.86/5.75 +2.00%
> >
>
> This problem is dear to my heart and I have been pondering it on and off
> for some time now. The entire fork + exec idiom is terrible and needs tox
> be retired.
>
> Is this vibe-coded? I asked claude for in-kernel posix_spawn for kicks
> some time ago and it generated remarkably similar code. But that's a
> tangent.

Partly, yes. The original idea came from using agents myself and noticing
that they spend a lot of time starting short-lived tools such as rg, sed,
git, bash, and python. I was wondering whether repeated tool calls could
be made cheaper.

After that I used an LLM to bounce around the smallest kernel prototype
for the idea. I did some review, patch split, test, benchmark, leak-check work,
and throw away some cache codes that not actually useful.

> I'm rather confused by the angle in the patchset. Most of this shaves
> off a tiny amount of work, while retaining the primary avoidable reason
> for bad performance: the very fact that fork is part of the picture,
> especially the part mucking with mm. Creating a pristine process is the
> way to go.
>
> Additionally there is a known problem where transiently copied file
> descriptors on fork + exec cause a headache in multithreaded programs
> doing something like this in parallel. I only did cursory reading, it
> seems your patchset keeps the same problem in place.
>
> There are numerous impactful ways to speed up execs both in terms of
> single-threaded cost and their multicore scalability, most of which
> would be immediately usable by all programs without an opt-in. imo these
> needs to be exhausted before something like a "template" can be
> considered.
>
> Per the above, the primary win would stem from *NOT* messing with mm.
>
> As in, whatever the interface, it needs to create an "empty" target
> process (for lack of a better term).
>
> In terms of userspace-visible APIs, a clean solution escapes me.
>
> Some time ago I proposed returning a handle which is populated over time
> by the parnet-to-be. One of the problems with it I failed to consider at
> the time is NUMA locality -- what if the process to be created is going
> to run on another domain? For example, opening and installing a file for
> its later use will result in avoidable loss of locality for some of the
> in-kernel data. That's on top of the fd vs fork problem.
>
> From perf standpoint, the final goal of whatever mechanism should be a
> state where the target process avoided copying any state it did not need
> to and which allocated any memory it needed from local NUMA node
> (whatever it may happen to be). Of course if no affinity is assigned it
> may happen to move again and lose such locality, nothing can be done
> about that. But pretend the process is to run in a specific node the
> parent is NOT running in.
>
> So I think the pragmatic way forward is to implement something close to
> posix_spawn in the kernel. It may make sense for the thing to take the
> PATH argument for repeated exec attempts. I understand this is of no use
> in your particular case, but it very much IS of use for most of the
> real-world. The initial implementation might even start with doing vfork
> just to get it off the ground.
>
> The next step would be to extend the interface with means to AVOID
> copying any file descriptors. There could be a dedicated file action
> which tells the kernel to avoid such copies or something like a
> close_range file action (or close_from) -- with a range like <0, INT_MAX>
> you know no fds are copied.
>
> For the NUMA angle to be sorted out, any file action which opens a file
> or dups from the parent needs to execute in the child. And frankly
> something would be needed to ask the scheduler where does it think the
> child is going to run, so that the task_struct itself can also be
> allocated with the right backing.
>
> I have not looked into what's needed to create a new process and NOT
> mess with mm, but I don't think there are unsolvable problems there, at
> worst some churn.
>
> There are of course other parameters which need to be sorted out, that's
> covered by the posix_spawn thing.
>
> This e-mail is long enough, so I'm not going to go into issues
> concerning exec itself right now.
>
> tl;dr I would suggest redoing the patchset as posix_spawn and then doing
> the actual optimization of not cloning mm itself.
>

Thanks a lot for writing this up. I clearly had too narrow a view of the
problem. I was mostly thinking about repeated executable startup, but your
reply and Christian's and Andy's made me see that the more useful target is probably
a pidfd/pidfs-backed process builder which can sit under posix_spawn, and
then grow into something that avoids the fork-shaped mm and fd costs. I
learned a lot from this thread.

At a high level, Windows CreateProcess/NtCreateUserProcess also looks
closer to this direction than fork+exec: create the target process
directly, pass explicit startup attributes and handle inheritance state,
and avoid starting from a copy of the parent address space. That seems
to be the same basic advantage here: build the child closer to its final
shape instead of copying parent state and then throwing much of it away.

I will study the process creation, exec, pidfd/pidfs, and posix_spawn
codes more carefully, then try the direction you suggested
and benchmark the mm/fd costs.

Regards,
Li