Re: ~90s reboot delay with v6.19 and PREEMPT_RT

From: Bert Karwatzki

Date: Wed Feb 25 2026 - 11:47:25 EST

Am Mittwoch, dem 25.02.2026 um 16:43 +0100 schrieb Sebastian Andrzej Siewior:
> On 2026-02-19 17:46:47 [+0100], Bert Karwatzki wrote:
> > Since linux v6.19 I noticed that rebooting my MSI Alpha 15 Laptop
> > would hang for about ~90s before rebooting. I bisected this (from
> > v6.18 to v6.19) and got this as the first bad commit:
> > 9311e6c29b34 ("cgroup: Fix sleeping from invalid context warning on PREEMPT_RT")
> …
>
> I'm on it. I looks like we free the task after sched_process_wait() but
> before it is entirely gone there is a wait() on its pid. Some of them do
> come back but one seems to be stuck and I need to figure out which one.
> If we get rid of the LAZY then it happens "quick" enough so it works.
>
> > Bert Karwatzki
>
> Sebastian

I've done two testruns with this debug patch (The persistant log buffer works now, thanks
again to Steven Rostedt):

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 5f0d33b04910..b750aa284b89 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -6990,6 +6990,7 @@ static void do_cgroup_task_dead(struct task_struct *tsk)
{
struct css_set *cset;
unsigned long flags;
+ trace_printk(KERN_INFO "%s 0: task = %px\n", __func__, tsk);

spin_lock_irqsave(&css_set_lock, flags);

@@ -7029,9 +7030,11 @@ static void cgrp_dead_tasks_iwork_fn(struct irq_work *iwork)
{
struct llist_node *lnode;
struct task_struct *task, *next;
+ trace_printk(KERN_INFO "%s:\n", __func__);

lnode = llist_del_all(this_cpu_ptr(&cgrp_dead_tasks));
llist_for_each_entry_safe(task, next, lnode, cg_dead_lnode) {
+ trace_printk(KERN_INFO "%s: %px %s", __func__, task, task->comm);
do_cgroup_task_dead(task);
put_task_struct(task);
}
@@ -7050,6 +7053,7 @@ static void __init cgroup_rt_init(void)

void cgroup_task_dead(struct task_struct *task)
{
+ trace_printk(KERN_INFO "%s: task = %px (%s)\n", __func__, task, task->comm);
get_task_struct(task);
llist_add(&task->cg_dead_lnode, this_cpu_ptr(&cgrp_dead_tasks));
irq_work_queue(this_cpu_ptr(&cgrp_dead_tasks_iwork));
@@ -7059,6 +7063,7 @@ static void __init cgroup_rt_init(void) {}

void cgroup_task_dead(struct task_struct *task)
{
+ trace_printk(KERN_INFO "%s: task = %px (%s)\n", __func__, task, task->comm);
do_cgroup_task_dead(task);
}
#endif /* CONFIG_PREEMPT_RT */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 854984967fe2..19b130b831bc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5078,6 +5078,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
struct rq *rq = this_rq();
struct mm_struct *mm = rq->prev_mm;
unsigned int prev_state;
+ trace_printk(KERN_INFO "%s 0: %px (%s)\n", __func__, prev, prev->comm);

/*
* The previous task will have left us with a preempt_count of 2
@@ -5153,6 +5154,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
* visible to SCX schedulers.
*/
sched_ext_dead(prev);
+ trace_printk(KERN_INFO "%s 1: %px (%s)\n", __func__, prev, prev->comm);
cgroup_task_dead(prev);

/* Task is done with its stack. */
@@ -5202,6 +5204,7 @@ static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next, struct rq_flags *rf)
{
+ trace_printk(KERN_INFO "%s 0: %px (%s)\n", __func__, prev, prev->comm);
prepare_task_switch(rq, prev, next);

/*

This if from PREEMPT_RT log, there*s a long pause in which cgroup_task_dead() is not called
59366: <...>-3209 [001] d..2. 33.110392: 0xffffffffa36c309b: 6context_switch 0: ffff933eb264a180 (reboot)
[...]
112455: <idle>-0 [006] ...1. 40.503766: 0xffffffffa2da570c: 6cgroup_task_dead: task = ffff933f1885c300 ((udev-worker))
[...] no call to cgroup_task_dead() here, just finish_task_switch() and context_switch()
217571: <idle>-0 [010] ...1. 125.282118: 0xffffffffa2da570c: 6cgroup_task_dead: task = ffff933e94fae480 (systemd)
[...]
274103 <idle>-0 [014] d..2. 130.157472: 0xffffffffa2cef125: 6finish_task_switch 0: ffff933e815e10c0 (ksoftirqd/14)

This is other log (no pause here, just the first messages after reboot is initiated and
the last message, to show duration of shutdown):

58029: <...>-2975 [003] d..2. 33.564291: 0xffffffff934b3e89: 6context_switch 0: ffff88e1e0302180 (reboot)
[...]
107700 <...>-1 [000] d..2. 37.352191: 0xffffffff92aee9c5: 6finish_task_switch 0: ffffffff93c12980 (swapper/0)

The complete logs are here:
https://gitlab.freedesktop.org/spasswolf/pastebin/-/issues/2

Bert Karwatzki