Re: [PATCH 1/4] ns: add bpf hooks
From: Christian Brauner
Date: Fri Feb 27 2026 - 06:04:46 EST
On Mon, Feb 23, 2026 at 01:44:23PM +0100, Djalal Harouni wrote:
> On 2/20/26 01:38, Christian Brauner wrote:
> > Add the three namespace lifecycle hooks and make them available to bpf
> > lsm program types. This allows bpf to supervise namespace creation. I'm
> > in the process of adding various "universal truth" bpf programs to
> > systemd that will make use of this. This e.g., allows to lock in a
> > program into a given set of namespaces.
>
> Thank you Christian, so if this feature is added we will also
> use it.
>
> The commit log says lock in a given set of namespaces where I see
> only setns path am I right? would it make sense to also have the
Yes.
> check around some callers of create_new_namespaces() where
> appropriate befor nsproxy switch if we don't want to go deep, but
> allow a bit of control or easy checks around
> CLONE_NEWNS/mount/pivot_root fs combinations?
Yes, I have planned that but we will massage that codepath quite a bit
this cycle to deal with some races so I'd rather push this out for this
reason and also...
... I need to think about how exactly we should hook into that. Probably
when we already have assembled the new namespace set but then I want to
pass it to the hook in a way that I can guarantee KF_TRUSTED_ARGS so
callers can use the macros I have to cast from struct ns_common to
actual namespace type.
We will need additional per-ns type hooks in the future as well. Like,
One would very likely want to supervise writes of idmappings to a userns
and so we need to add hooks for that into /proc/<pid>/{g,u}id_map as
well... and setgroups now come to think of it.
An fwiw, I'm replacing pivot_root() this cycle and I expect userspace to
fade it out eventually. It's an insane system call that holds tasklist
lock to walk _all task_ on the system each time you switch the
container's rootfs just to mess with the pwd and root. That creates all
kinds of races and no container setup actually needs to do the pwd/root
replacement.
So it's really unneeded unless you do weird stuff like switching out the
rootfs in init_mnt_ns post early boot. Which is insane and can't work
for a lot of other reasons and the pwd/root rewrite doesn't solve
pinning via fds anyway so really that all needs to be Michael Myers'ed.
Next release MOVE_MOUNT_BENEATH will take over that job by making it
work with locked mounts and the rootfs.