Re: [PATCH] cgroup: Wait for dying tasks to leave on rmdir
From: Sebastian Andrzej Siewior
Date: Mon Mar 23 2026 - 07:33:05 EST
On 2026-03-22 17:58:06 [-1000], Tejun Heo wrote:
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -6224,6 +6225,63 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
…
> +static int cgroup_drain_dying(struct cgroup *cgrp)
> + __releases(&cgroup_mutex) __acquires(&cgroup_mutex)
> +{
> + struct css_task_iter it;
> + struct task_struct *task;
> + DEFINE_WAIT(wait);
> +
> + lockdep_assert_held(&cgroup_mutex);
> +retry:
> + if (!cgroup_is_populated(cgrp))
> + return 0;
> +
> + /* Same iterator as cgroup.threads - if any task is visible, it's busy */
> + css_task_iter_start(&cgrp->self, 0, &it);
> + task = css_task_iter_next(&it);
> + css_task_iter_end(&it);
> +
> + if (task)
> + return -EBUSY;
> +
> + /*
> + * All remaining tasks are PF_EXITING and will pass through
> + * cgroup_task_dead() shortly. Wait for a kick and retry.
> + */
> + prepare_to_wait(&cgrp->dying_populated_waitq, &wait,
> + TASK_UNINTERRUPTIBLE);
> + mutex_unlock(&cgroup_mutex);
I had to add here
if (cgroup_is_populated(cgrp))
> + schedule();
I saw instances on PREEMPT_RT where the above cgroup_is_populated()
reported true due to cgrp->nr_populated_csets = 1, the following
iterator returned NULL but in that time do_cgroup_task_dead() saw no
waiter and continued without a wake_up and then the following schedule()
hung.
There is no serialisation between this wait/ check and latter wake. An
alternative would be to check and prepare_to_wait() under css_set_lock.
> + finish_wait(&cgrp->dying_populated_waitq, &wait);
> + mutex_lock(&cgroup_mutex);
> + goto retry;
> +}
Then I added my RCU patch. This led to a problem already during boot up
(didn't manage to get to the test suite).
systemd-1 places modprobe-1044 in a cgroup, then destroys the cgroup.
It hangs in cgroup_drain_dying() because nr_populated_csets is still 1.
modprobe-1044 is still there in Z so the cgroup removal didn't get there
yet. That irq_work was quicker than RCU in this case. This can be
reproduced without RCU by
- irq_work_queue(this_cpu_ptr(&cgrp_dead_tasks_iwork));
+ schedule_delayed_work(this_cpu_ptr(&cgrp_delayed_tasks_iwork), HZ);
So there is always a one second delay. If I give up waiting after 10secs
then it boots eventually and there are no zombies around. The test_core
seems to complete…
Having the irq_work as-is, then the "cgroup_dead()" happens on the HZ
tick. test_core then complains just with
| not ok 7 test_cgcore_populated
everything else passes. With schedule_work() (as in right away) all
tests pass including test_stress.sh
Is there another race lurking?
Sebastian