Re: [PATCH v2 10/10] sched/eevdf: Move to a single runqueue

From: Zhang Qiao

Date: Tue May 26 2026 - 05:36:23 EST

Hi Prateek,

在 2026/5/26 17:15, K Prateek Nayak 写道:
> Hello Zhang,
>
> On 5/26/2026 1:23 PM, Zhang Qiao wrote:
>> Testing sched/flat branch on AMD EPYC 9654 (384 CPUs, 8 NUMA nodes)
>> with a 2-level cgroup hierarchy and cfs_bandwidth quota enabled,
>> hackbench triggers a divide-by-zero oops:
>>
>> [ 142.308571] divide error: 0000 [#1] SMP NOPTI
>> [ 142.308582] RIP: 0010:task_tick_fair+0x19e/0x410
>> [ 142.308601] Call Trace:
>> [ 142.308604] <IRQ>
>> [ 142.308607] scheduler_tick+0x6a/0x110
>> [ 142.308609] update_process_times+0x6b/0x90
>> [ 142.308611] tick_sched_handle+0x2a/0x70
>> [ 142.308613] tick_sched_timer+0x57/0xb0
>
> More of this trace would have been helpful.
>
>>
>> faddr2line confirms:
>>
>> task_tick_fair+0x19e/0x410:
>> __calc_prop_weight at kernel/sched/fair.c:4085
>> (inlined by) task_tick_fair at kernel/sched/fair.c:13576
>
> Those line numbers don't match on the latest sched/flat but since you
> mention this happens with throttling, I believe it is tick hitting
> somewhere in between the task being dequeued by throttle_cfs_rq_work()
> and the CPU rescheduling and taking the task off the runqueue.
>

Sorry for the confusion on the line numbers — the mismatch was due
to some local debug code I had added on top of sched/flat,
not a difference in the base tree.

> Dequeue from throttle is slightly special since it keeps the task on
> runqueue but the sched entity goes off the cfs_rq changing the
> hierarchical weights.
> > Can you check if this helps:
>
> (Lightly tested with your reproducer)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b8bae794f063..d96e5915fb3e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -14815,18 +14815,21 @@ static inline void task_tick_core(struct rq *rq, struct task_struct *curr) {}
> static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
> {
> struct sched_entity *se = &curr->se;
> - unsigned long weight = NICE_0_LOAD;
> - struct cfs_rq *cfs_rq;
>
> - for_each_sched_entity(se) {
> - cfs_rq = cfs_rq_of(se);
> - entity_tick(cfs_rq, se, queued);
> + if (se->on_rq) {
> + unsigned long weight = NICE_0_LOAD;
> + struct cfs_rq *cfs_rq;
>
> - weight = __calc_prop_weight(cfs_rq, se, weight);
> - }
> + for_each_sched_entity(se) {
> + cfs_rq = cfs_rq_of(se);
> + entity_tick(cfs_rq, se, queued);
> +
> + weight = __calc_prop_weight(cfs_rq, se, weight);
> + }
>
> - se = &curr->se;
> - reweight_eevdf(cfs_rq, se, weight, se->on_rq);
> + se = &curr->se;
> + reweight_eevdf(cfs_rq, se, weight, se->on_rq);
> + }
>

throttle_cfs_rq_work() sets se->on_rq = 0 while the task is still running as
rq->curr, and the subsequent tick should not attempt to reweight an
already-dequeued entity. The unthrottle enqueue will handle the reweight anyway.

I've tested your suggested diff on my AMD EPYC 9654 (384 CPUs, 8 NUMA
nodes) and it resolves the crash. The reproducer no longer triggers the
divide error after running for several minutes.

Tested-by: Zhang Qiao <zhangqiao22@xxxxxxxxxx>

Thanks,
Zhang Qiao

.

> if (queued)
> return;
> ---
>
> I don't think it makes too much sense to reweight an entity that
> has been dequeued. The enqueue at unthrottle will do it anyways.
>
>>
>> ===========================================================
>> Reproduction
>> ===========================================================
>>
>> Kernel: sched/flat branch (54d493980e00 and later)
>> Hardware: AMD EPYC 9654, 2S 384 logical CPUs
>>
>> # 2-level cgroup, quota = 50% of one period
>> cgcreate -g cpu:/bw/l1/l2
>> cgset -r cpu.cfs_quota_us=50000 /bw/l1/l2
>> cgset -r cpu.cfs_period_us=100000 /bw/l1/l2
>>
>> # high task count amplifies the throttle→tick race window
>> cgexec -g cpu:/bw/l1/l2 hackbench -g 48 -l 1000 -s 512 -T
>>
>> Typically crashes within 30 seconds on this machine. A single-CPU
>> kernel or a very loose quota (e.g. 90%) is unlikely to trigger it
>> because the race window is narrow.
>
> This was helpful! I see:
>
> [ 209.935597] Oops: divide error: 0000 [#1] SMP NOPTI
> [ 209.941061] CPU: 329 UID: 0 PID: 8247 Comm: sched-messaging Not tainted 7.1.0-rc2-test+ #73 PREEMPT(full)
> [ 209.951841] Hardware name: AMD Corporation Titanite_4G/Titanite_4G, BIOS RTI100CC 03/28/2024
> [ 209.961254] RIP: 0010:task_tick_fair+0x10d/0x850
> [ 209.966420] Code: dc 00 00 00 4c 89 f7 e8 f1 52 ff ff 45 85 e4 0f 85 ba 00 00 00 49 8b 06 4d 8b b6 b8 00 00 00 48 0f af c3 4d 85 f6 74 19 31 d2 <49> f7 37 ba 02 00 00 00 48 89 d3 48 39 d0 48 0f 43 d8 e9 20 ff ff
> [ 209.987382] RSP: 0018:ff581fd71e1fce58 EFLAGS: 00010046
> [ 209.993216] RAX: 0000010000000000 RBX: 0000000000100000 RCX: ff295dbfa9ad8080
> [ 210.001179] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ff295dbfa9ad8080
> [ 210.009141] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000000063eb
> [ 210.017104] R10: 0000000000000000 R11: ff581fd71e1fcff8 R12: 0000000000000000
> [ 210.025061] R13: ff295dbfa9ad8000 R14: ff295dc06c6eac00 R15: ff295dbfd9bc8600
> [ 210.033027] FS: 00007faef8c8b640(0000) GS:ff295e7c4acca000(0000) knlGS:0000000000000000
> [ 210.042060] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 210.048474] CR2: 00007f9884292d30 CR3: 000000011aa26001 CR4: 0000000000f71ef0
> [ 210.056430] PKRU: 55555554
> [ 210.059448] Call Trace:
> [ 210.062177] <IRQ>
> [ 210.064426] sched_tick+0x94/0x250
> [ 210.068229] update_process_times+0x99/0xc0
> [ 210.072903] tick_nohz_handler+0x95/0x1a0
> [ 210.077380] ? __pfx_tick_nohz_handler+0x10/0x10
> [ 210.082534] __hrtimer_run_queues+0xfe/0x260
> [ 210.087304] hrtimer_interrupt+0x122/0x1f0
> [ 210.091880] __sysvec_apic_timer_interrupt+0x55/0x130
> [ 210.097525] sysvec_apic_timer_interrupt+0x7a/0xb0
> [ 210.102873] </IRQ>
> [ 210.105203] <TASK>
> [ 210.107542] asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [ 210.113284] RIP: 0010:_raw_spin_unlock_irqrestore+0x1d/0x40
> [ 210.119511] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 c6 07 00 0f 1f 00 f7 c6 00 02 00 00 74 06 fb 0f 1f 44 00 00 <65> ff 0d ec 20 fd 01 74 05 e9 c0 81 d4 fe e8 00 93 ec fe e9 b6 81
> [ 210.140469] RSP: 0018:ff581fd74032fe88 EFLAGS: 00000206
> [ 210.146308] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
> [ 210.154271] RDX: 0000000000000000 RSI: 0000000000000246 RDI: ff295dbfa9ad8d64
> [ 210.162235] RBP: ff295dbfa9ad8000 R08: 0000000000000000 R09: 0000000000000000
> [ 210.170196] R10: 0000000000000000 R11: 0000000000000000 R12: ff295dbfa9ad8d64
> [ 210.178159] R13: ff581fd74032ff48 R14: ff295dbfa9ad8000 R15: 00fffffffffff000
> [ 210.186139] task_work_run+0x5c/0x90
> [ 210.190137] exit_to_user_mode_loop+0x16e/0x550
> [ 210.195198] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 210.200552] ? ksys_read+0xc5/0xe0
> [ 210.204352] do_syscall_64+0x26e/0x750
> [ 210.208540] ? do_syscall_64+0xaa/0x750
> [ 210.212823] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 210.218174] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> ---
>
> So the theory of throttle work causing this checks out.
>

> The suggested diff above solves the crash in my case but your
> mileage may vary. Peter can comment if this is the right thing
> to do or not :-)
>