Re: [PATCH v3 2/2] io_uring/io-wq: inherit cpuset of cgroup in io worker

From: MOESSBAUER, Felix
Date: Wed Sep 11 2024 - 03:03:19 EST


On Tue, 2024-09-10 at 13:42 -0400, Waiman Long wrote:
>
> On 9/10/24 13:11, Felix Moessbauer wrote:
> > The io worker threads are userland threads that just never exit to
> > the
> > userland. By that, they are also assigned to a cgroup (the group of
> > the
> > creating task).
>
> The io-wq task is not actually assigned to a cgroup. To belong to a
> cgroup, its pid has to be present to the cgroup.procs of the
> corresponding cgroup, which is not the case here.

Hi, thanks for jumping in. As said, I'm not too familiar with the
internals of the io worker threads. Nonetheless, the kernel presents
the cgroup assignment quite consistently. This however contradicts your
statement from above. Example:

pid tid
648460 648460 SCHED_OTHER 20 S 0 0-1 ./test/wq-aff.t
648460 648461 SCHED_OTHER 20 S 1 1 iou-sqp-648460
648460 648462 SCHED_OTHER 20 S 0 0 iou-wrk-648461

When I now check the cgroup.procs, I just see the 648460, which is
expected as this the process (with its main thread). Checking
cgroup.threads shows all three tids.

When checking the other way round, I get the same information:
$cat /proc/648460/task/648461/cgroup
0::/user.slice/user-1000.slice/session-1.scope
$cat /proc/648460/task/648462/cgroup
0::/user.slice/user-1000.slice/session-1.scope

Now I'm wondering if it is just presented incorrectly, or if these
tasks indeed belong to the mentioned cgroup?

> My understanding is
> that you are just restricting the CPU affinity to follow the cpuset
> of
> the corresponding user task that creates it. The CPU affinity
> (cpumask)
> is just one of the many resources controlled by a cgroup. That
> probably
> needs to be clarified.

That's clear. Looking at the bigger picture, I want to ensure that the
io workers do not break out of the cgroup limits (I called it "ambient"
before, similar to the capabilites), because this breaks the isolation
assumption. In our case, we are mostly interested in not leaving the
cpuset, as we use that to perform system partitioning into realtime and
non realtime parts.

>
> Besides cpumask, the cpuset controller also controls the node mask of
> the memory nodes allowed.

Yes, and that is especially important as some memory can be "closer" to
the IOs than others.

Best regards,
Felix

--
Siemens AG, Technology
Linux Expert Center