Re: [PATCH v3 cgroup/for-7.0-fixes] cgroup: Fix cgroup_drain_dying() testing the wrong condition

From: Sebastian Andrzej Siewior

Date: Thu Mar 26 2026 - 03:41:23 EST

On 2026-03-25 14:02:05 [-1000], Tejun Heo wrote:
> > The only issue I see is if I delay the irq_work callback by a second.
> > Other than that, I don't see any problems.
>
> What issue do you see when delaying it by a second? Just things being slowed
> down?

This is during boot:

[ OK ] Mounted sys-kernel-debug.mount - Kernel Debug File System.
[ OK ] Mounted sys-kernel-tracing.mount - Kernel Trace File System.
[ OK ] Mounted tmp.mount - Temporary Directory /tmp.
[ 20.845878] INFO: task systemd:1 blocked for more than 10 seconds.
[ 20.845885] Not tainted 7.0.0-rc5+ #178
[ 20.845887] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 20.845888] task:systemd state:D stack:0 pid:1 tgid:1 ppid:0 task_flags:0x400100 flags:0x00080000
[ 20.845906] Call Trace:
[ 20.845911] <TASK>
[ 20.845915] __schedule+0x3db/0xf90
[ 20.845947] schedule+0x27/0xd0
[ 20.845950] cgroup_drain_dying+0x9b/0x190
[ 20.845971] cgroup_rmdir+0x2d/0x100
[ 20.845980] kernfs_iop_rmdir+0x6a/0xd0
[ 20.845993] vfs_rmdir+0x11a/0x280
[ 20.846002] filename_rmdir+0x16f/0x1e0
[ 20.846009] __x64_sys_rmdir+0x28/0x40
[ 20.846015] do_syscall_64+0x119/0x5a0
[ 20.846152] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 20.846158] RIP: 0033:0x7ff495627337
[ 20.846164] RSP: 002b:00007ffd7efa66f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000054
[ 20.846170] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ff495627337
[ 20.846172] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00005646ca9583a0
[ 20.846173] RBP: 00005646ca9583a0 R08: 000000000000000c R09: 0000000000000000
[ 20.846174] R10: 0000000000000000 R11: 0000000000000246 R12: 00005646ca957ac0
[ 20.846175] R13: 0000000000000001 R14: 0000000000000004 R15: 0000000000000000
[ 20.846178] </TASK>

It does not recover. Therefore I think there might be another race
lurking. This is what I talk about:

--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -7112,9 +7112,9 @@ static void do_cgroup_task_dead(struct task_struct *tsk)
* irq_work to allow batching while ensuring timely completion.
*/
static DEFINE_PER_CPU(struct llist_head, cgrp_dead_tasks);
-static DEFINE_PER_CPU(struct irq_work, cgrp_dead_tasks_iwork);
+static DEFINE_PER_CPU(struct delayed_work, cgrp_delayed_tasks_iwork);

-static void cgrp_dead_tasks_iwork_fn(struct irq_work *iwork)
+static void cgrp_dead_tasks_iwork_fn(struct work_struct *iwork)
{
struct llist_node *lnode;
struct task_struct *task, *next;
@@ -7131,9 +7131,11 @@ static void __init cgroup_rt_init(void)
int cpu;

for_each_possible_cpu(cpu) {
+ struct delayed_work *dwork;
+
init_llist_head(per_cpu_ptr(&cgrp_dead_tasks, cpu));
- per_cpu(cgrp_dead_tasks_iwork, cpu) =
- IRQ_WORK_INIT_LAZY(cgrp_dead_tasks_iwork_fn);
+ dwork = &per_cpu(cgrp_delayed_tasks_iwork, cpu);
+ INIT_DELAYED_WORK(dwork, cgrp_dead_tasks_iwork_fn);
}
}

@@ -7141,7 +7143,7 @@ void cgroup_task_dead(struct task_struct *task)
{
get_task_struct(task);
llist_add(&task->cg_dead_lnode, this_cpu_ptr(&cgrp_dead_tasks));
- irq_work_queue(this_cpu_ptr(&cgrp_dead_tasks_iwork));
+ schedule_delayed_work(this_cpu_ptr(&cgrp_delayed_tasks_iwork), HZ);
}
#else /* CONFIG_PREEMPT_RT */
static void __init cgroup_rt_init(void) {}

> Thanks.
>

Sebastian