LTP: cfs_bandwidth01: Unable to handle kernel NULL pointer dereference

From: Yong Wang
Date: Thu Sep 14 2023 - 11:20:02 EST


Hello!
>Following kernel crash noticed on Linux stable-rc 6.5.3-rc1 on qemu-arm64 while
>running LTP sched tests cases.
>
>This is not always reproducible.
I also encountered this problem on linux 5.10 on arm64 environment.
The prompt information is as follows:
[ 2893.003795] ==================================================================
[ 2893.003822] BUG: KASAN: null-ptr-deref in pick_next_task_fair+0x130/0x4e0
[ 2893.003880] Read of size 8 at addr 0000000000000080 by task ksoftirqd/0/12
[ 2893.003901]
[ 2893.003914] CPU: 0 PID: 12 Comm: ksoftirqd/0 Tainted: P O 5.10.59-rt52#1
[ 2893.003959] Call trace:
[ 2893.003968] dump_backtrace+0x0/0x2e8
[ 2893.004009] show_stack+0x18/0x28
[ 2893.004032] dump_stack+0x104/0x174
[ 2893.004067] kasan_report+0x1d0/0x258
[ 2893.004098] __asan_load8+0x94/0xd0
[ 2893.004126] pick_next_task_fair+0x130/0x4e0
[ 2893.004164] __schedule+0x220/0xbd0
[ 2893.004192] schedule+0xec/0x1a0
[ 2893.004216] smpboot_thread_fn+0x124/0x548
[ 2893.004246] kthread+0x24c/0x278
[ 2893.004277] ret_from_fork+0x10/0x34
[ 2893.004306] ==================================================================
[ 2893.004325] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000080
[ 2893.152267] Mem abort info:
[ 2893.152639] ESR = 0x96000004
[ 2893.153045] EC = 0x25: DABT (current EL), IL = 32 bits
[ 2893.153739] SET = 0, FnV = 0
[ 2893.154143] EA = 0, S1PTW = 0
[ 2893.154560] Data abort info:
[ 2893.154940] ISV = 0, ISS = 0x00000004
[ 2893.155443] CM = 0, WnR = 0
[ 2893.155838] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000188edb000

The source code where the problem occurs corresponds to:
se = pick_next_entity(cfs_rq, curr);
cfs_rq = group_cfs_rq(se); //se is NULL!

It is found that pick_next_entity returns null, so null-ptr-dere appears when accessing the members of se later.
But it is not clear under what circumstances pick_next_entity returns null.

In addition, in my environment, the following operations often recur:
stress-ng -c 8 --cpu-load 100 --sched fifo --sched-prio 1 --cpu-method pi -t 900 &
runltp -s cfs_bandwidth01

Hope it helps to solve the problem.
Thanks.