Re: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups

From: Martin KaFai Lau

Date: Thu Oct 30 2025 - 18:42:30 EST




On 10/30/25 2:34 PM, Song Liu wrote:
Hi Roman,

On Thu, Oct 30, 2025 at 12:07 PM Roman Gushchin
<roman.gushchin@xxxxxxxxx> wrote:
[...]
In TCP congestion control and BPF qdisc's model:

During link_create, both adds the struct_ops to a list, and the
struct_ops can be indexed by name. The struct_ops are not "active" by
this time.
Then, each has their own interface to 'apply' the struct_ops to a
socket or queue: setsockopt() or netlink.

But maybe cgroup-related struct_ops are different.

Both tcp congestion and qdisk cases are somewhat different because
there already is a way to select between multiple implementations, bpf
just adds another one. In the oom case, it's not true. As of today,
there is only one (global) oom killer. Of course we can create
interfaces to allow a user make a choice. But the question is do we want
to create such interface for the oom case specifically (and later for
each new case separately), or there is a place for some generalization?

Agreed that this approach requires a separate mechanism to attach
the struct_ops to an entity.

Ok, let me summarize the options we discussed here:

Thanks for the summary!


1) Make the attachment details (e.g. cgroup_id) the part of struct ops
itself. The attachment is happening at the reg() time.

+: It's convenient for complex stateful struct ops'es, because a
single entity represents a combination of code and data.
-: No way to attach a single struct ops to multiple entities.

This approach is used by Tejun for per-cgroup sched_ext prototype.

2) Make the attachment details a part of bpf_link creation. The
attachment is still happening at the reg() time.

+: A single struct ops can be attached to multiple entities.
-: Implementing stateful struct ops'es is harder and requires passing
an additional argument (some sort of "self") to all callbacks.
I'm using this approach in the bpf oom proposal.


I think both 1) and 2) have the following issue. With cgroup_id in
struct_ops or the link, the cgroup_id works more like a filter. The
cgroup doesn't hold any reference to the struct_ops. The bpf link
holds the reference to the struct_ops, so we need to keep the
the link alive, either by keeping an active fd, or by pinning the
link to bpffs. When the cgroup is removed, we need to clean up
the bpf link separately.

The link can be detached (struct_ops's unreg) by the user space.

The link can also be detached from the subsystem (cgroup) here.
It was requested by scx:
https://lore.kernel.org/all/20240530065946.979330-7-thinker.li@xxxxxxxxx/

Not sure if scx has started using it.


3) Move the attachment out of .reg() scope entirely. reg() will register
the implementation system-wide and then some 3rd-party interface
(e.g. cgroupfs) should be used to select the implementation.

+: ?
-: New hard-coded interfaces might be required to enable bpf-driven
kernel customization. The "attachment" code is not shared between
various struct ops cases.
Implementing stateful struct ops'es is harder and requires passing
an additional argument (some sort of "self") to all callbacks.

This approach works well for cases when there is already a selection
of implementations (e.g. tcp congestion mechanisms), and bpf is adding
another one.

Another benefit of 3) is that it allows loading an OOM controller in a
kernel module, just like loading a file system in a kernel module. This
is possible with 3) because we paid the cost of adding a new select
attach interface.

A semi-separate topic, option 2) enables attaching a BPF program
to a kernel object (a cgroup here, but could be something else). This
is an interesting idea, and we may find it useful in other cases (attach
a BPF program to a task_struct, etc.).

Does it have plan for a pure kernel module oom implementation?
I think the link-to-cgrp support here does not necessary stop the
later write to cgroupfs support if a kernel module oom is indeed needed
in the future.

imo, cgroup-bpf has a eco-system around it, so it is sort of special. bpf user
has expectation on how a bpf prog is attached to a cgroup. The introspection,
auto detachment from the cgroup when the link is gone...etc.

If link-to-cgrp is used, I prefer (2). Stay with one way to attach
to a cgrp. It is also consistent with the current way of attaching a single
bpf prog to a cgroup. It is now attaching a map/set of bpf prog to a cgroup.
The individual struct_ops implementation can decide if it should
allow a struct_ops be attached multiple times.