Re: [PATCH cgroup/for-6.19] cgroup: Fix sleeping from invalid context warning on PREEMPT_RT

From: Sebastian Andrzej Siewior

Date: Thu Nov 06 2025 - 10:07:21 EST


On 2025-11-05 09:03:55 [-1000], Tejun Heo wrote:
> +#ifdef CONFIG_PREEMPT_RT
> +/*
> + * cgroup_task_dead() is called from finish_task_switch() which doesn't allow
> + * scheduling even in RT. As the task_dead path requires grabbing css_set_lock,
> + * this lead to sleeping in the invalid context warning bug. css_set_lock is too
> + * big to become a raw_spinlock. The task_dead path doesn't need to run
> + * synchronously. Bounce through irq_work instead.
> + */
> +static DEFINE_PER_CPU(struct llist_head, cgrp_dead_tasks);
> +static DEFINE_PER_CPU(struct irq_work, cgrp_dead_tasks_iwork);
> +
> +static void cgrp_dead_tasks_iwork_fn(struct irq_work *iwork)
> +{
> + struct llist_node *lnode;
> + struct task_struct *task, *next;
> +
> + lnode = llist_del_all(this_cpu_ptr(&cgrp_dead_tasks));
> + llist_for_each_entry_safe(task, next, lnode, cg_dead_lnode) {
> + do_cgroup_task_dead(task);
> + put_task_struct(task);
> + }
> +}
> +
> +static void __init cgroup_rt_init(void)
> +{
> + int cpu;
> +
> + for_each_possible_cpu(cpu) {
> + init_llist_head(per_cpu_ptr(&cgrp_dead_tasks, cpu));
> + init_irq_work(per_cpu_ptr(&cgrp_dead_tasks_iwork, cpu),
> + cgrp_dead_tasks_iwork_fn);

How important is it, that it happens right away? Written as-is, this
leads to an interrupt then wakes irq_work/$cpu thread which then runs
this callback. That thread runs as SCHED_FIFO-1. This means the
termination of a SCHED_OTHER tasks on a single CPU will run as follows:
- TASK_DEAD
schedule()
- queue IRQ_WORK
-> INTERRUPT
-> WAKE irq_work
-> preempt to irq_work/
-> handle one callback
schedule()
back to next TASK_DEAD

So cgrp_dead_tasks_iwork_fn() will never have to opportunity to batch.
Unless the exiting task's priority is > 1. Then it will be delayed
until all RT tasks are done.

My proposal would be to init the irq_work item with
*per_cpu_ptr(&cgrp_dead_tasks_iwork, cpu) = IRQ_WORK_INIT_LAZY(cgrp_dead_tasks_iwork_fn);

instead which won't raise an IRQ immediately and delay the callback
until the next timer tick. So it could batch multiple tasks.

[ queue_work() should work, too but the overhead to schedule is greater
imho so this makes sense ]

Sebastian