Re: sched: observed instability under stress in 6.12 and mainline

From: Hao Jia

Date: Mon Oct 13 2025 - 01:54:48 EST




On 2025/10/13 11:03, Jiping Ma wrote:
Hi,

I'd like to draw the attention of the scheduler maintainers to a number
of kernel bugzilla reports submitted by a colleague a couple of weeks ago:

6.12.18:
https://bugzilla.kernel.org/show_bug.cgi?id=220447
https://bugzilla.kernel.org/show_bug.cgi?id=220448

v6.16-rt3
https://bugzilla.kernel.org/show_bug.cgi?id=220450
https://bugzilla.kernel.org/show_bug.cgi?id=220449

There seems to be something wrong with either the logic or the locking.
In one case this resulted in a NULL pointer dereference in
pick_next_entity(). In another case it resulted in
BUG_ON(!rq->nr_running) in dequeue_top_rt_rq() and
SCHED_WARN_ON(!se->on_rq) in update_entity_lag().

My colleague suggests that the NULL pointer dereference may be due to
pick_eevdf() returning NULL in pick_next_entity().

I did some digging and found that
https://gitlab.com/linux-kernel/stable/-/commit/86b37810 would not have
been included in 6.12.18, but the equivalent fix should have been in the
6.16 load.

We haven't yet bottomed out the root cause.

Any suggestions or assistance would be appreciated.

Thanks,
Chris



Maybe this patch can be useful for your problem.
https://lore.kernel.org/all/tencent_3177343A3163451463643E434C61911B4208@xxxxxx/

If I understand correctly, we may dequeue_entity twice in
rt_mutex_setprio()/__sched_setscheduler(). cfs_bandwidth may break the
state of p->on_rq and se->on_rq.

Thank veruy much!
https://lore.kernel.org/all/tencent_3177343A3163451463643E434C61911B4208@xxxxxx/ can fix the original panic
https://bugzilla.kernel.org/show_bug.cgi?id=220447, now we encounter the other !se->on_rq WARNING. Do you know
we already have the fix?


Perhaps the following patch is more suitable for fixing the previous panic.

https://lore.kernel.org/all/105ae6f1-f629-4fe7-9644-4242c3bed035@xxxxxxx/


This issue has been resolved in the latest kernel mainline by refactoring cfs_bandwidth.

As Peter mentioned, we need to submit a separate fix patch for the stable branch.

https://lore.kernel.org/all/20250929103836.GK3419281@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/

Thanks,
Hao