Re: [peterz-queue:sched/prep] [sched/fair] 124c8f4374: WARNING:at_kernel/sched/sched.h:#update_load_avg

From: K Prateek Nayak
Date: Wed Jul 31 2024 - 02:17:14 EST


Hello Peter, Oliver,

P.S. I see that peterz-queue has been updated recently so I'm not sure
if these observations still hold but if they do ...

On 7/29/2024 11:44 AM, kernel test robot wrote:


Hello,

kernel test robot noticed "WARNING:at_kernel/sched/sched.h:#update_load_avg" on:

commit: 124c8f43740f3f1e3e2aeecd976bcf5ddd7bd435 ("sched/fair: Implement delayed dequeue")
https://git.kernel.org/cgit/linux/kernel/git/peterz/queue.git sched/prep

in testcase: boot

compiler: clang-18
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)


we noticed the issue does not always happen, 60 times out of 150 runs as below.
but the parent keeps clean.

a1133ac6dc7c6bde 124c8f43740f3f1e3e2aeecd976
---------------- ---------------------------
fail:runs %reproduction fail:runs
| | |
:150 40% 60:150 dmesg.WARNING:at_kernel/sched/sched.h:#update_load_avg



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@xxxxxxxxx>
| Closes: https://lore.kernel.org/oe-lkp/202407291328.761441a7-oliver.sang@xxxxxxxxx


[ 13.522527][ C0] ------------[ cut here ]------------
[ 13.523046][ C0] rq->clock_update_flags < RQCF_ACT_SKIP
[ 13.523055][ C0] WARNING: CPU: 0 PID: 158 at kernel/sched/sched.h:1550 update_load_avg (kernel/sched/sched.h:1550 kernel/sched/pelt.h:67 kernel/sched/pelt.h:174 kernel/sched/fair.c:4747)
[ 13.524265][ C0] Modules linked in: drm fuse ip_tables
[ 13.524722][ C0] CPU: 0 UID: 0 PID: 158 Comm: openipmi Not tainted 6.10.0-12745-g124c8f43740f #1
[ 13.525464][ C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 13.526307][ C0] RIP: 0010:update_load_avg (kernel/sched/sched.h:1550 kernel/sched/pelt.h:67 kernel/sched/pelt.h:174 kernel/sched/fair.c:4747)
[ 13.526793][ C0] Code: 00 4c 2b bb 88 01 00 00 40 f6 c5 02 0f 84 b7 fa ff ff e9 ca fa ff ff c6 05 b9 1b e4 01 01 48 c7 c7 34 74 d6 a6 e8 a1 c7 fa ff <0f> 0b e9 78 fa ff ff 65 8b 0d 8f 83 91 5a 89 c9 48 0f a3 0d e9 7b
All code
========
0: 00 4c 2b bb add %cl,-0x45(%rbx,%rbp,1)
4: 88 01 mov %al,(%rcx)
6: 00 00 add %al,(%rax)
8: 40 f6 c5 02 test $0x2,%bpl
c: 0f 84 b7 fa ff ff je 0xfffffffffffffac9
12: e9 ca fa ff ff jmpq 0xfffffffffffffae1
17: c6 05 b9 1b e4 01 01 movb $0x1,0x1e41bb9(%rip) # 0x1e41bd7
1e: 48 c7 c7 34 74 d6 a6 mov $0xffffffffa6d67434,%rdi
25: e8 a1 c7 fa ff callq 0xfffffffffffac7cb
2a:* 0f 0b ud2 <-- trapping instruction
2c: e9 78 fa ff ff jmpq 0xfffffffffffffaa9
31: 65 8b 0d 8f 83 91 5a mov %gs:0x5a91838f(%rip),%ecx # 0x5a9183c7
38: 89 c9 mov %ecx,%ecx
3a: 48 rex.W
3b: 0f .byte 0xf
3c: a3 .byte 0xa3
3d: 0d .byte 0xd
3e: e9 .byte 0xe9
3f: 7b .byte 0x7b

Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: e9 78 fa ff ff jmpq 0xfffffffffffffa7f
7: 65 8b 0d 8f 83 91 5a mov %gs:0x5a91838f(%rip),%ecx # 0x5a91839d
e: 89 c9 mov %ecx,%ecx
10: 48 rex.W
11: 0f .byte 0xf
12: a3 .byte 0xa3
13: 0d .byte 0xd
14: e9 .byte 0xe9
15: 7b .byte 0x7b
[ 13.528234][ C0] RSP: 0000:ffffbaef40003d90 EFLAGS: 00010046
[ 13.528723][ C0] RAX: b53834d2850b7b00 RBX: ffff9ef2efd34640 RCX: 0000000000000027
[ 13.529379][ C0] RDX: 0000000000000002 RSI: 00000000ffff7fff RDI: ffff9ef2efc20b88
[ 13.530029][ C0] RBP: 0000000000000001 R08: 0000000000007fff R09: ffffffffa7053610
[ 13.530685][ C0] R10: 0000000000017ffd R11: 0000000000000004 R12: ffff9ef2efd345c0
[ 13.531344][ C0] R13: ffff9ef2efd34640 R14: ffff9eefc0e26400 R15: 0000000000000000
[ 13.532001][ C0] FS: 0000000000000000(0000) GS:ffff9ef2efc00000(0063) knlGS:00000000f7f5f180
[ 13.532720][ C0] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
[ 13.533242][ C0] CR2: 00000000577c51ac CR3: 0000000162e6c000 CR4: 00000000000406f0
[ 13.533891][ C0] Call Trace:
[ 13.534215][ C0] <IRQ>
[ 13.534504][ C0] ? __warn (kernel/panic.c:240 kernel/panic.c:735)
[ 13.534863][ C0] ? update_load_avg (kernel/sched/sched.h:1550 kernel/sched/pelt.h:67 kernel/sched/pelt.h:174 kernel/sched/fair.c:4747)
[ 13.535278][ C0] ? report_bug (lib/bug.c:? lib/bug.c:219)
[ 13.535695][ C0] ? handle_bug (arch/x86/kernel/traps.c:239)
[ 13.536079][ C0] ? exc_invalid_op (arch/x86/kernel/traps.c:260)
[ 13.536498][ C0] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:621)
[ 13.536920][ C0] ? update_load_avg (kernel/sched/sched.h:1550 kernel/sched/pelt.h:67 kernel/sched/pelt.h:174 kernel/sched/fair.c:4747)
[ 13.537334][ C0] dequeue_entity (kernel/sched/sched.h:817 kernel/sched/fair.c:5429)
[ 13.537729][ C0] dequeue_entities (kernel/sched/fair.c:6972)
[ 13.538130][ C0] unregister_fair_sched_group (kernel/sched/fair.c:379 kernel/sched/fair.c:13159)
[ 13.554647][ C0] ? __pfx_sched_unregister_group_rcu (kernel/sched/core.c:8759)
[ 13.555167][ C0] sched_unregister_group_rcu (kernel/sched/core.c:8707 kernel/sched/core.c:8761)
[ 13.555621][ C0] rcu_do_batch (arch/x86/include/asm/preempt.h:26 kernel/rcu/tree.c:2576)
[ 13.556008][ C0] rcu_core (kernel/rcu/tree.c:2845)
[ 13.556373][ C0] handle_softirqs (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 include/trace/events/irq.h:142 kernel/softirq.c:555)
[ 13.556775][ C0] __irq_exit_rcu (kernel/softirq.c:617 kernel/softirq.c:639)
[ 13.557166][ C0] sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1043 arch/x86/kernel/apic/apic.c:1043)

Looking at the call stack, any chance it is the following case that is
tripping this warning:

CPU0 CPU1
==== ====
wakeup_preempt(rq) {
resched_curr(rq); ------------------------> sysvec_reschedule_ipi() {
... scheduler_ipi();
if (task_on_rq_queued(rq->curr) && } /* sysvec_reschedule_ipi */
test_tsk_need_resched(rq->curr)) handle_softirqs() {
rq_clock_skip_update(rq); /* RCU_SOFTIRQ */
} /* wakeup_preempt(); */ rcu_core_si() {
unregister_fair_sched_group() {
/* On the rq that got the resched IPI */
if (se->sched_delayed) {
dequeue_entities() {
/* Trips RQCF_ACT_SKIP check within since waker called rq_clock_skip_update() */
} /* dequeue_entities(); */
} /* if() */
} /* unregister_fair_sched_group */
} /* rcu_core_si */
} /* handle_softirqs */
--

commit a64692a3afd8 ("sched: Cleanup/optimize clock updates") seems to
have added this condition to skip clock updates after an enqueue since
wakeup target will do a schedule() soon. Since delayed dequeue may has
entity to dequeue now, looking at rq_clock_pelt() in dequeue path will
trip this since PELT believes that the clock is stale. Could something
like the below work?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cd4a6bf14828..c437b408d29b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13297,10 +13297,34 @@ void unregister_fair_sched_group(struct task_group *tg)
if (se->sched_delayed) {
guard(rq_lock_irqsave)(rq);
if (se->sched_delayed) {
+ /*
+ * We can reach here when processing RCU_SOFTIRQ on exit path from
+ * a reschedule IPI. wakeup_preempt() may have set RQCF_REQ_SKIP to
+ * skip a close clock update in schedule(), however, in presence of
+ * a delayed entity, this trips the check in rq_clock_pelt() which
+ * now believes the clock value is stale and needs updating. To
+ * prevent such situation, cancel any pending skip updates, and
+ * update the rq clock.
+ */
+ rq_clock_cancel_skipupdate(rq);
+
+ /*
+ * XXX: Will this trip WARN_DOUBLE_CLOCK? In which case, can
+ * rq_clock_cancel_skipupdate() be made to return a bool if
+ * RQCF_REQ_SKIP is set and we avoid this update?
+ */
update_rq_clock(rq);
+
dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
+
+ /* Avoid updating the clock again if a schedule() is pending */
+ if (task_on_rq_queued(rq->curr) &&
+ test_tsk_need_resched(rq->curr))
+ rq_clock_skip_update(rq);
}
list_del_leaf_cfs_rq(cfs_rq);
+
+
}
remove_entity_load_avg(se);
}
--

I was not able to hit the SCHED_WARN_ON() reliably on my end when
launching a guest to confirm my hypothesis - I was able to hit it
approx. 1 in 10 run, but I've ran with the above changes 10s of
times and I've not hit the splat (yet!)

Note: Although the diff was updated for the latest
peterz-queue:sched/eevdf, all my testing was done on the previous
version on this branch based on v6.10.0 kernel.
--
Thanks and Regards,
Prateek

[ 13.557638][ C0] </IRQ>
[ 13.557920][ C0] <TASK>
[ 13.560463][ C0] asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:702)
[ 13.560957][ C0] RIP: 0010:irqentry_exit_to_user_mode (kernel/entry/common.c:? include/linux/entry-common.h:328 kernel/entry/common.c:231)
[ 13.561465][ C0] Code: 90 41 57 41 56 53 48 89 fb e8 63 90 1f ff 65 4c 8b 34 25 40 37 03 00 49 8b 3e f7 c7 0e 30 02 00 74 7c 49 89 ff e8 97 8f 1f ff <4c> 89 f8 a8 08 75 38 a9 00 10 00 00 75 40 a9 00 20 00 00 75 4b a9
All code
========
0: 90 nop
1: 41 57 push %r15
3: 41 56 push %r14
5: 53 push %rbx
6: 48 89 fb mov %rdi,%rbx
9: e8 63 90 1f ff callq 0xffffffffff1f9071
e: 65 4c 8b 34 25 40 37 mov %gs:0x33740,%r14
15: 03 00
17: 49 8b 3e mov (%r14),%rdi
1a: f7 c7 0e 30 02 00 test $0x2300e,%edi
20: 74 7c je 0x9e
22: 49 89 ff mov %rdi,%r15
25: e8 97 8f 1f ff callq 0xffffffffff1f8fc1
2a:* 4c 89 f8 mov %r15,%rax <-- trapping instruction
2d: a8 08 test $0x8,%al
2f: 75 38 jne 0x69
31: a9 00 10 00 00 test $0x1000,%eax
36: 75 40 jne 0x78
38: a9 00 20 00 00 test $0x2000,%eax
3d: 75 4b jne 0x8a
3f: a9 .byte 0xa9

Code starting with the faulting instruction
===========================================
0: 4c 89 f8 mov %r15,%rax
3: a8 08 test $0x8,%al
5: 75 38 jne 0x3f
7: a9 00 10 00 00 test $0x1000,%eax
c: 75 40 jne 0x4e
e: a9 00 20 00 00 test $0x2000,%eax
13: 75 4b jne 0x60
15: a9 .byte 0xa9
[ 13.562938][ C0] RSP: 0000:ffffbaef40613f38 EFLAGS: 00000202
[ 13.563444][ C0] RAX: ffff9ef0287dd080 RBX: ffffbaef40613f58 RCX: 000000000000080b
[ 13.564125][ C0] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000020000008
[ 13.564796][ C0] RBP: 0000000000000000 R08: 000000000000002b R09: 000000000000002b
[ 13.565456][ C0] R10: 0000000000000296 R11: 00000000ff8772ac R12: 0000000000000000
[ 13.566115][ C0] R13: 0000000000000000 R14: ffff9ef0287dd080 R15: 0000000020000008
[ 13.566782][ C0] ? irqentry_exit_to_user_mode (kernel/entry/common.c:? include/linux/entry-common.h:328 kernel/entry/common.c:231)
[ 13.567254][ C0] asm_sysvec_reschedule_ipi (arch/x86/include/asm/idtentry.h:707)
[ 13.567709][ C0] RIP: 0023:0x56652630
[ 13.568072][ C0] Code: ff ff 83 c4 10 85 c0 74 05 83 c4 08 5b c3 83 ec 0c 8d 83 8d 6b ff ff 50 e8 cd 7c ff ff 8d b4 26 00 00 00 00 8d b6 00 00 00 00 <55> 57 56 53 e8 e7 61 ff ff 81 c3 13 38 01 00 83 ec 0c 8b 44 24 20
All code
========
0: ff (bad)
1: ff 83 c4 10 85 c0 incl -0x3f7aef3c(%rbx)
7: 74 05 je 0xe
9: 83 c4 08 add $0x8,%esp
c: 5b pop %rbx
d: c3 retq
e: 83 ec 0c sub $0xc,%esp
11: 8d 83 8d 6b ff ff lea -0x9473(%rbx),%eax
17: 50 push %rax
18: e8 cd 7c ff ff callq 0xffffffffffff7cea
1d: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi
24: 8d b6 00 00 00 00 lea 0x0(%rsi),%esi
2a:* 55 push %rbp <-- trapping instruction
2b: 57 push %rdi
2c: 56 push %rsi
2d: 53 push %rbx
2e: e8 e7 61 ff ff callq 0xffffffffffff621a
33: 81 c3 13 38 01 00 add $0x13813,%ebx
39: 83 ec 0c sub $0xc,%esp
3c: 8b 44 24 20 mov 0x20(%rsp),%eax

Code starting with the faulting instruction
===========================================
0: 55 push %rbp
1: 57 push %rdi
2: 56 push %rsi
3: 53 push %rbx
4: e8 e7 61 ff ff callq 0xffffffffffff61f0
9: 81 c3 13 38 01 00 add $0x13813,%ebx
f: 83 ec 0c sub $0xc,%esp
12: 8b 44 24 20 mov 0x20(%rsp),%eax


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240729/202407291328.761441a7-oliver.sang@xxxxxxxxx