Re: [PATCH 2/4] cgroup: add bpf hook for attach
From: Christian Brauner
Date: Fri Feb 27 2026 - 08:44:37 EST
On Mon, Feb 23, 2026 at 04:47:11PM +0100, Michal Koutný wrote:
> Hi.
>
> On Fri, Feb 20, 2026 at 01:38:30AM +0100, Christian Brauner <brauner@xxxxxxxxxx> wrote:
> > Add a hook to manage attaching tasks to cgroup. I'm in the process of
> > adding various "universal truth" bpf programs to systemd that will make
> > use of this.
> >
> > This has been a long-standing request (cf. [1] and [2]). It will allow us to
> > enforce cgroup migrations and ensure that services can never escape their
> > cgroups. This is just one of many use-cases.
> >
> > Link: https://github.com/systemd/systemd/issues/6356 [1]
> > Link: https://github.com/systemd/systemd/issues/22874 [2]
>
> These two issues are misconfigured/misunderstood PAM configs. I don't
> think those warrant introduction of another permissions mechanism,
> furthermore they're relatively old and I estimate many of such configs
> must have been fixed in the course of time.
logind has to allow cgroup migrations but for say Docker this shouldn't
be allowed. So calling this misconfiguration is like taking a shortcut
by simply pointing to a different destination. But fine, let's say you
insist on this not being valid.
> As for services escaping their cgroups -- they needn't run as root, do
> they? And if you seek a mechanism how to prevent even root from
> migrations, there are cgroupnses for that. (BTW what would prevent a
A bunch of tools that do cgroup migrations don't use cgroup namespaces
and there's no requirement or way to enforce that they do. Plus, there's
no requirement to only do cgroup management via systemd or its APIs.
Frankly, I can't even blame userspace for not having widely adopted
cgroup namespaces. The implementation atop of a single superblock like
cgroupfs is questionable.
But in general the point is that there's no mechanism to enforce cgroup
tree policy currently in a sufficiently flexible manner.
> root detaching/disabling these hook progs anyway?)
I cannot help but read this as you asking me "What if you're too dumb to
write a security policy that isn't self-defeating?" :)
bpf has security hooks for itself including security_bpf(). First thing
that comes to mind is to have security.bpf.* or trusted.* xattrs on
selected processes like PID 1 that mark it as eligible for modifying BPF
state or BPF LSM programs supervising link/prog detach, update etc and
then designating only PID 1 as handing out those magical xattrs. Can be
as fine-grained as needed and that tells everyone else to go away and do
something else.
There's more fine-grained mechanisms to deal with this. IOW, it's a
solvable problem.
> I think that the cgroup file permissions are sufficient for many use
> cases and this BPF hook is too tempting in unnecessary cases (like
> masking other issues).
> Could you please expand more about some other reasonable use cases not
> covered by those?
systemd will gain the ability to implement policy to control cgroup tree
modifications in as much details as it needs without having the kernel
in need to be aware of it. This can take various forms by marking only
select processes as being eligible for managing cgroup migrations or
even just locking down specific cgroups.
The policy needs to be flexible so it can be live-updated, switched into
auditing mode, and losened, tightened on-demand as needed.
> (BTW I notice there's already a very similar BPF hook in sched_ext's
> cgroup_prep_move. It'd be nicer to have only one generic approach to
> these checks.)
This feels a bit like a wild goose chase. But fine, I'll look at it.
/me goes off
Ok, let's start with cgroup_can_fork(). The sched ext hook isn't a
generic permission check. It's called way after
cgroup_attach_permissions() and is a per cgroup controller check that is
only called for some cgroup controllers. So effectively useless to pull
up (Notice also, how some controllers like cpuset call additional
security hooks already.).
The same problem applies to writes for cgroup.procs and for subtree
control. The sched ext hook are per cgroup controller not generically
called.
And they happen to be called in cgroup_migrate_execute() which is way
deep in the callchain. When cgroup_attach_permissions() fails it's
effectively free. If migrate_execute() fails it must put/free css sets,
it must splice back task on mg_tasks, it must call cancel_attach()
callbacks, thus it must call the sched-ext cancel callbacks for each
already prepped task, it must uncharge pids for each already prepped
task, it needs to unlock a bunch of stuff.
On top of that this looks like a category mistake imho. The callbacks
are a dac-like permission mechanism whereas the hooks is actual mac
permission checking. I'm not sure lumping this together with
per-cgroup-controller migration preparations will be very clean. I think
it will end up looking rather confusing. But that's best left to you
cgroup maintainers, I think.