Re: [PATCH 00/24] Complete EEVDF

From: Hongyan Xia
Date: Wed Aug 21 2024 - 07:14:06 EST


On 20/08/2024 17:43, Hongyan Xia wrote:
Hi Peter,

On 27/07/2024 11:27, Peter Zijlstra wrote:
Hi all,

So after much delay this is hopefully the final version of the EEVDF patches.
They've been sitting in my git tree for ever it seems, and people have been
testing it and sending fixes.

I've spend the last two days testing and fixing cfs-bandwidth, and as far
as I know that was the very last issue holding it back.

These patches apply on top of queue.git sched/dl-server, which I plan on merging
in tip/sched/core once -rc1 drops.

I'm hoping to then merge all this (+- the DVFS clock patch) right before -rc2.


Aside from a ton of bug fixes -- thanks all! -- new in this version is:

  - split up the huge delay-dequeue patch
  - tested/fixed cfs-bandwidth
  - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating
  - SCHED_BATCH is equivalent to RESPECT_SLICE
  - propagate min_slice up cgroups
  - CLOCK_THREAD_DVFS_ID


The latest tip/sched/core at commit

aef6987d89544d63a47753cf3741cabff0b5574c

crashes very early on on my Juno r2 board (arm64). The trace is here:

[    0.049599] ------------[ cut here ]------------
[    0.054279] kernel BUG at kernel/sched/deadline.c:63!
[    0.059401] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
[    0.066285] Modules linked in:
[    0.069382] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.11.0-rc1-g55404cef33db #1070
[    0.077855] Hardware name: ARM Juno development board (r2) (DT)
[    0.083856] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    0.090919] pc : enqueue_dl_entity+0x53c/0x540
[    0.095434] lr : dl_server_start+0xb8/0x10c
[    0.099679] sp : ffffffc081ca3c30
[    0.103034] x29: ffffffc081ca3c40 x28: 0000000000000001 x27: 0000000000000002
[    0.110281] x26: 00000000000b71b0 x25: 0000000000000000 x24: 0000000000000001
[    0.117525] x23: ffffff897ef21140 x22: 0000000000000000 x21: 0000000000000000
[    0.124770] x20: ffffff897ef21040 x19: ffffff897ef219a8 x18: ffffffc080d0ad00
[    0.132015] x17: 000000000000002f x16: 0000000000000000 x15: ffffffc081ca8000
[    0.139260] x14: 00000000016ef200 x13: 00000000000e6667 x12: 0000000000000001
[    0.146505] x11: 000000003b9aca00 x10: 0000000002faf080 x9 : 0000000000000030
[    0.153749] x8 : 0000000000000071 x7 : 000000002cf93d25 x6 : 000000002cf93d25
[    0.160994] x5 : ffffffc081e04938 x4 : ffffffc081ca3d40 x3 : 0000000000000001
[    0.168238] x2 : 000000003b9aca00 x1 : 0000000000000001 x0 : ffffff897ef21040
[    0.175483] Call trace:
[    0.177958]  enqueue_dl_entity+0x53c/0x540
[    0.182117]  dl_server_start+0xb8/0x10c
[    0.186010]  enqueue_task_fair+0x5c8/0x6ac
[    0.190165]  enqueue_task+0x54/0x1e8
[    0.193793]  wake_up_new_task+0x250/0x39c
[    0.197862]  kernel_clone+0x140/0x2f0
[    0.201578]  user_mode_thread+0x4c/0x58
[    0.205468]  rest_init+0x24/0xd8
[    0.208743]  start_kernel+0x2bc/0x2fc
[    0.212460]  __primary_switched+0x80/0x88
[    0.216535] Code: b85fc3a8 7100051f 54fff8e9 17ffffce (d4210000)
[    0.222711] ---[ end trace 0000000000000000 ]---
[    0.227391] Kernel panic - not syncing: Attempted to kill the idle task!
[    0.234187] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---

I'm not an expert in DL server so I have no idea where the problem could be. If you know where to look off the top of your head then much better. If not, I'll do some bi-section later.


Okay, in case the trace I provided isn't clear enough, I traced the crash to a call chain like this:

dl_server_start()
enqueue_dl_entity()
update_stats_enqueue_dl()
update_stats_enqueue_sleeper_dl()
__schedstats_from_dl_se()
dl_task_of() <---------- crash

If I undefine CONFIG_SCHEDSTATS, then it boots fine, and I wonder if this is the reason why other people are not seeing this. This is probably not EEVDF but DL refactoring related.