Re: [PATCH v2 10/10] sched/eevdf: Move to a single runqueue

From: K Prateek Nayak

Date: Tue May 26 2026 - 05:22:49 EST

Hello Zhang,

On 5/26/2026 1:23 PM, Zhang Qiao wrote:
> Testing sched/flat branch on AMD EPYC 9654 (384 CPUs, 8 NUMA nodes)
> with a 2-level cgroup hierarchy and cfs_bandwidth quota enabled,
> hackbench triggers a divide-by-zero oops:
>
> [ 142.308571] divide error: 0000 [#1] SMP NOPTI
> [ 142.308582] RIP: 0010:task_tick_fair+0x19e/0x410
> [ 142.308601] Call Trace:
> [ 142.308604] <IRQ>
> [ 142.308607] scheduler_tick+0x6a/0x110
> [ 142.308609] update_process_times+0x6b/0x90
> [ 142.308611] tick_sched_handle+0x2a/0x70
> [ 142.308613] tick_sched_timer+0x57/0xb0

More of this trace would have been helpful.

>
> faddr2line confirms:
>
> task_tick_fair+0x19e/0x410:
> __calc_prop_weight at kernel/sched/fair.c:4085
> (inlined by) task_tick_fair at kernel/sched/fair.c:13576

Those line numbers don't match on the latest sched/flat but since you
mention this happens with throttling, I believe it is tick hitting
somewhere in between the task being dequeued by throttle_cfs_rq_work()
and the CPU rescheduling and taking the task off the runqueue.

Dequeue from throttle is slightly special since it keeps the task on
runqueue but the sched entity goes off the cfs_rq changing the
hierarchical weights.

Can you check if this helps:

(Lightly tested with your reproducer)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b8bae794f063..d96e5915fb3e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -14815,18 +14815,21 @@ static inline void task_tick_core(struct rq *rq, struct task_struct *curr) {}
static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
struct sched_entity *se = &curr->se;
- unsigned long weight = NICE_0_LOAD;
- struct cfs_rq *cfs_rq;

- for_each_sched_entity(se) {
- cfs_rq = cfs_rq_of(se);
- entity_tick(cfs_rq, se, queued);
+ if (se->on_rq) {
+ unsigned long weight = NICE_0_LOAD;
+ struct cfs_rq *cfs_rq;

- weight = __calc_prop_weight(cfs_rq, se, weight);
- }
+ for_each_sched_entity(se) {
+ cfs_rq = cfs_rq_of(se);
+ entity_tick(cfs_rq, se, queued);
+
+ weight = __calc_prop_weight(cfs_rq, se, weight);
+ }

- se = &curr->se;
- reweight_eevdf(cfs_rq, se, weight, se->on_rq);
+ se = &curr->se;
+ reweight_eevdf(cfs_rq, se, weight, se->on_rq);
+ }

if (queued)
return;
---

I don't think it makes too much sense to reweight an entity that
has been dequeued. The enqueue at unthrottle will do it anyways.

>
> ===========================================================
> Reproduction
> ===========================================================
>
> Kernel: sched/flat branch (54d493980e00 and later)
> Hardware: AMD EPYC 9654, 2S 384 logical CPUs
>
> # 2-level cgroup, quota = 50% of one period
> cgcreate -g cpu:/bw/l1/l2
> cgset -r cpu.cfs_quota_us=50000 /bw/l1/l2
> cgset -r cpu.cfs_period_us=100000 /bw/l1/l2
>
> # high task count amplifies the throttle→tick race window
> cgexec -g cpu:/bw/l1/l2 hackbench -g 48 -l 1000 -s 512 -T
>
> Typically crashes within 30 seconds on this machine. A single-CPU
> kernel or a very loose quota (e.g. 90%) is unlikely to trigger it
> because the race window is narrow.

This was helpful! I see:

[ 209.935597] Oops: divide error: 0000 [#1] SMP NOPTI
[ 209.941061] CPU: 329 UID: 0 PID: 8247 Comm: sched-messaging Not tainted 7.1.0-rc2-test+ #73 PREEMPT(full)
[ 209.951841] Hardware name: AMD Corporation Titanite_4G/Titanite_4G, BIOS RTI100CC 03/28/2024
[ 209.961254] RIP: 0010:task_tick_fair+0x10d/0x850
[ 209.966420] Code: dc 00 00 00 4c 89 f7 e8 f1 52 ff ff 45 85 e4 0f 85 ba 00 00 00 49 8b 06 4d 8b b6 b8 00 00 00 48 0f af c3 4d 85 f6 74 19 31 d2 <49> f7 37 ba 02 00 00 00 48 89 d3 48 39 d0 48 0f 43 d8 e9 20 ff ff
[ 209.987382] RSP: 0018:ff581fd71e1fce58 EFLAGS: 00010046
[ 209.993216] RAX: 0000010000000000 RBX: 0000000000100000 RCX: ff295dbfa9ad8080
[ 210.001179] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ff295dbfa9ad8080
[ 210.009141] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000000063eb
[ 210.017104] R10: 0000000000000000 R11: ff581fd71e1fcff8 R12: 0000000000000000
[ 210.025061] R13: ff295dbfa9ad8000 R14: ff295dc06c6eac00 R15: ff295dbfd9bc8600
[ 210.033027] FS: 00007faef8c8b640(0000) GS:ff295e7c4acca000(0000) knlGS:0000000000000000
[ 210.042060] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 210.048474] CR2: 00007f9884292d30 CR3: 000000011aa26001 CR4: 0000000000f71ef0
[ 210.056430] PKRU: 55555554
[ 210.059448] Call Trace:
[ 210.062177] <IRQ>
[ 210.064426] sched_tick+0x94/0x250
[ 210.068229] update_process_times+0x99/0xc0
[ 210.072903] tick_nohz_handler+0x95/0x1a0
[ 210.077380] ? __pfx_tick_nohz_handler+0x10/0x10
[ 210.082534] __hrtimer_run_queues+0xfe/0x260
[ 210.087304] hrtimer_interrupt+0x122/0x1f0
[ 210.091880] __sysvec_apic_timer_interrupt+0x55/0x130
[ 210.097525] sysvec_apic_timer_interrupt+0x7a/0xb0
[ 210.102873] </IRQ>
[ 210.105203] <TASK>
[ 210.107542] asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 210.113284] RIP: 0010:_raw_spin_unlock_irqrestore+0x1d/0x40
[ 210.119511] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 c6 07 00 0f 1f 00 f7 c6 00 02 00 00 74 06 fb 0f 1f 44 00 00 <65> ff 0d ec 20 fd 01 74 05 e9 c0 81 d4 fe e8 00 93 ec fe e9 b6 81
[ 210.140469] RSP: 0018:ff581fd74032fe88 EFLAGS: 00000206
[ 210.146308] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
[ 210.154271] RDX: 0000000000000000 RSI: 0000000000000246 RDI: ff295dbfa9ad8d64
[ 210.162235] RBP: ff295dbfa9ad8000 R08: 0000000000000000 R09: 0000000000000000
[ 210.170196] R10: 0000000000000000 R11: 0000000000000000 R12: ff295dbfa9ad8d64
[ 210.178159] R13: ff581fd74032ff48 R14: ff295dbfa9ad8000 R15: 00fffffffffff000
[ 210.186139] task_work_run+0x5c/0x90
[ 210.190137] exit_to_user_mode_loop+0x16e/0x550
[ 210.195198] ? srso_alias_return_thunk+0x5/0xfbef5
[ 210.200552] ? ksys_read+0xc5/0xe0
[ 210.204352] do_syscall_64+0x26e/0x750
[ 210.208540] ? do_syscall_64+0xaa/0x750
[ 210.212823] ? srso_alias_return_thunk+0x5/0xfbef5
[ 210.218174] entry_SYSCALL_64_after_hwframe+0x76/0x7e
---

So the theory of throttle work causing this checks out.

The suggested diff above solves the crash in my case but your
mileage may vary. Peter can comment if this is the right thing
to do or not :-)

--
Thanks and Regards,
Prateek