Re: [PATCH] cgroup: Wait for dying tasks to leave on rmdir

From: Sebastian Andrzej Siewior

Date: Tue Mar 24 2026 - 04:38:57 EST

On 2026-03-23 09:55:40 [-1000], Tejun Heo wrote:
> Hello,
Hi,

> > Then I added my RCU patch. This led to a problem already during boot up
> > (didn't manage to get to the test suite).
>
> Is that the patch to move cgroup_task_dead() to delayed_put_task_struct()? I
> don't think we can delay populated state update till usage count reaches
> zero. e.g. bpf_task_acquire() can be used by arbitrary bpf programs and will
> pin the usage count indefinitely delaying populated state update. Similar to
> delaying the event to free path, you can construct a deadlock scenario too.

Okay, then. I expected it to be limited window within a bpf program or
the sched_ext.

> > systemd-1 places modprobe-1044 in a cgroup, then destroys the cgroup.
> > It hangs in cgroup_drain_dying() because nr_populated_csets is still 1.
> > modprobe-1044 is still there in Z so the cgroup removal didn't get there
> > yet. That irq_work was quicker than RCU in this case. This can be
> > reproduced without RCU by
>
> Isn't this the exact scenario? systemd is the one who should reap and drop
> the usage count but it's waiting for rmdir() to finish which can't finish
> due to the usage count which hasn't been reapted by systemd? We can't
> interlock these two. They have to make progress independently.

But nobody is holding it back. For some reason systemd-1 did not reap
modprobe-1044 first but went first for the rmdir(). I noticed it with
RCU first but it was also there after delayed the cleanup by one second
without RCU.

> > - irq_work_queue(this_cpu_ptr(&cgrp_dead_tasks_iwork));
> > + schedule_delayed_work(this_cpu_ptr(&cgrp_delayed_tasks_iwork), HZ);
> >
> > So there is always a one second delay. If I give up waiting after 10secs
> > then it boots eventually and there are no zombies around. The test_core
> > seems to complete…
> >
> > Having the irq_work as-is, then the "cgroup_dead()" happens on the HZ
> > tick. test_core then complains just with
> > | not ok 7 test_cgcore_populated
>
> The test is assuming that waitpid() success guarantees cgroup !populated
> event. While before all these changes, that held, it wasn't intentional and
> the test just picked up on arbitrary ordering. I'll just remove that
> particular test.

okay. Thanks.

> Thanks.

Sebastian