Re: [PATCH 0/4] sched: Various reweight_entity() fixes

From: Shubhang Kaushik

Date: Sat Feb 14 2026 - 02:21:13 EST


Hi Peter,

On Fri, 30 Jan 2026, Peter Zijlstra wrote:

Two issues related to reweight_entity() were raised; poking at all that got me
these patches.

They're in queue.git/sched/core and I spend most of yesterday staring at traces
trying to find anything wrong. So far, so good.

Please test.



Iʼm seeing a consistent NULL pointer dereference in pick_task_fair() when running hackbench on an Ampere Altra (80 cores arm64). This is happening after applying the complete patchset on the latest 6.19.0+ kernel with PREEMPT_DYNAMIC (full), CONFIG_SCHED_CLUSTER and NOHZ_FULL enabled.

The system triggers a level 2 translation fault because pick_eevdf() returns NULL despite the runqueue having active tasks (cfs_rq->nr_running
0). When pick_next_task_fair() attempts to dereference this NULL pointer
to access the task structure, the kernel Oopses at pick_task_fair+0x48/0x148.

pick_task_fair <- pick_eevdf() <- [active tasks]

The root cause is an underflow in reweight_entity():
se->vprot -= avruntime;

Under heavy load, the average runtime can sometimes be larger than the protection time. Because these are unsigned numbers, the result wraps around to a large value instead of becoming zero.

[ 1284.596683] Mem abort info: [ 1284.596684] ESR = 0x0000000096000006
[ 1284.596685] EC = 0x25: DABT (current EL), IL = 32 bits
[ 1284.596687] SET = 0, FnV = 0
[ 1284.596688] EA = 0, S1PTW = 0
[ 1284.596689] FSC = 0x06: level 2 translation fault
[ 1284.596690] Data abort info:
[ 1284.596690] ISV = 0, ISS = 0x00000006, ISS2 = 0x00000000
[ 1284.596692] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[ 1284.596693] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[ 1284.596694] user pgtable: 4k pages, 48-bit VAs, pgdp=00000801917ee000
[ 1284.596695] [0000000000000051] pgd=08000801917ef403, p4d=08000801917ef403, pud=08000801a6ed8403, pmd=0000000000000000
[ 1284.597314] Internal error: Oops: 0000000096000006 [#1] SMP
[ 1284.670270] Modules linked in: joydev rfkill mlx5_ib ib_uverbs ib_core sunrpc binfmt_misc mlx5_core acpi_ipmi ipmi_ssif ipmi_devintf cdc_ether mlxfw usbnet psample mii arm_spe_pmu tls ipmi_msghandler arm_cmn arm_dmc620_pmu vfat fat arm_dsu_pmu cppc_cpufreq acpi_tad loop nfnetlink zram xfs uas usb_storage ast ghash_ce sbsa_gwdt nvme i2c_algo_bit nvme_core xgene_hwmon i2c_dev fuse
[ 1284.703809] CPU: 76 UID: 1000 PID: 17906 Comm: hackbench_bin Tainted: G W 6.19.0+ #166 PREEMPT(full)
[ 1284.714492] Tainted: [W]=WARN
[ 1284.717447] Hardware name: WIWYNN Mt.Jade Server System B81.03001.0014/Mt.Jade Motherboard, BIOS 2.10.20230517-1P (SCP: 2.10.20230517) 2023/05/17
[ 1284.730472] pstate: 004000c9 (nzcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 1284.737420] pc : pick_task_fair+0x48/0x148
[ 1284.741508] lr : pick_task_fair+0x48/0x148
[ 1284.745592] sp : ffff8000f92337d0
[ 1284.748894] x29: ffff8000f92337d0 x28: ffff080094f01f00 x27: ffff08012dc43300
[ 1284.756017] x26: ffff0801298189f0 x25: ffff63d900a14000 x24: 0000000000000001
[ 1284.763140] x23: ffff080129818000 x22: 0000000000000000 x21: ffff083e61642e80
[ 1284.770263] x20: ffff083e61643000 x19: ffff083e61643000 x18: 0000000000000000
[ 1284.777385] x17: 0000000000000000 x16: 0000000000000000 x15: ffff08011fc4f200
[ 1284.784508] x14: 000000000041fdf0 x13: 0000000000000004 x12: 0000000004aca7c0
[ 1284.791631] x11: ffffa4656247fa30 x10: ffffa46562818c48 x9 : ffffa4655e0620c8
[ 1284.798754] x8 : 0000000000000001 x7 : 0000000000000002 x6 : 0000000000000002
[ 1284.805877] x5 : 000000000000000b x4 : 00000000057ae40b x3 : fff73c190d1b5a9c
[ 1284.813000] x2 : 0000002f3f70fd54 x1 : 02eb323b0f008a9a x0 : 0000000000000000
[ 1284.820124] Call trace:
[ 1284.822558] pick_task_fair+0x48/0x148 (P)
[ 1284.826643] pick_next_task_fair+0x34/0x250
[ 1284.830815] __pick_next_task+0x4c/0x260
[ 1284.834727] pick_next_task+0x40/0xb68
[ 1284.838463] __schedule+0x184/0x790
[ 1284.841942] preempt_schedule_common+0x28/0x50
[ 1284.846374] dynamic_preempt_schedule+0x30/0x40
[ 1284.850893] kfree+0x2e8/0x448
[ 1284.853936] skb_free_head+0x54/0xc0
[ 1284.857501] skb_release_data+0x164/0x230
[ 1284.861499] consume_skb+0x78/0x1b0
[ 1284.864975] unix_stream_read_generic+0x818/0x950
[ 1284.869668] unix_stream_recvmsg+0xa4/0xc0
[ 1284.873753] sock_recvmsg+0x78/0xd0
[ 1284.877230] sock_read_iter+0xa4/0x118
[ 1284.880967] new_sync_read+0x18c/0x1b8
[ 1284.884704] vfs_read+0x1a4/0x200
[ 1284.888006] ksys_read+0xf4/0x118
[ 1284.891308] __arm64_sys_read+0x24/0x40
[ 1284.895132] invoke_syscall+0x6c/0x100
[ 1284.898869] el0_svc_common.constprop.0+0x48/0xf0
[ 1284.903561] do_el0_svc+0x24/0x38
[ 1284.906864] el0_svc+0x54/0x2f0
[ 1284.909993] el0t_64_sync_handler+0xa0/0xe8
[ 1284.914164] el0t_64_sync+0x19c/0x1a0
[ 1284.917815] Code: d503201f 52800021 aa1303e0 97ffab6f (39414401)

I was able to fix this by preventing the underflow in reweight_entity() and resetting the state in set_next_entity().

By using time_after64(), I cap the value at 0 instead of letting it wrap to a large number. I also added a reset (se->vprot = se->vruntime) to ensure the protection state stays synchronized with actual progress, clearing out any mismatch from the reweighting phase.

a/kernel/sched/fair.c +++ b/kernel/sched/fair.c
@@ -3972,7 +3972,7 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
se->deadline -= avruntime;
se->rel_deadline = 1;
if (curr && protect_slice(se)) {
- se->vprot -= avruntime;
+ se->vprot = (time_after64(se->vprot, avruntime)) ? se->vprot - avruntime : 0;
rel_vprot = true;
}

@@ -5608,6 +5608,8 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, bool first)

if (first)
set_protect_slice(cfs_rq, se);
+ else
+ se->vprot = se->vruntime;
}

Tested-by: Shubhang Kaushik <shubhang@xxxxxxxxxxxxxxxxxxxxxx>

Regards,
Shubhang Kaushik