Re: [PATCH] sched_ext: sync disable_irq_work in bpf_scx_unreg()
From: Andrea Righi
Date: Wed Apr 22 2026 - 06:50:41 EST
Hi Richard,
On Wed, Apr 22, 2026 at 06:09:38PM +0800, Richard Cheng wrote:
> When unregistered my self-written scx scheduler, the following panic
> occurs [1].
>
> The root cause is that the JIT page backing ops->quiescent() is freed
> before all callers of that function have stopped.
>
> The expected ordering during teardown is:
> bitmap_zero(sch->has_op) + synchronize_rcu()
> -> guarantees no CPU will ever call sch->ops.* again
> -> only THEN free the BPF struct_ops JIT page
>
> bpf_scx_unreg() is supposed to enforce the order, but after
> commit f4a6c506d118 ("sched_ext: Always bounce scx_disable() through
> irq_work"), disable_work is no longer queued directly, causing
> kthread_flush_work() to be a noop. Thus, the caller drops the struct_ops
> map too early and poisoned with AARCH64_BREAK_FAULT before
> disable_workfn ever execute.
>
> So the subsequent dequeue_task() still sees SCX_HAS_OP(sch, quiescent)
> as true and calls ops.quiescent, which hit on the poisoned page and BRK
> panic.
>
> Fix it by syncing disable_irq_work first, so disable_work is guaranteed
> to be queued before waiting for it.
>
> Fixes: f4a6c506d118 ("sched_ext: Always bounce scx_disable() through irq_work")
> Signed-off-by: Richard Cheng <icheng@xxxxxxxxxx>
> ---
> [1]:
> [ 188.572805] sched_ext: BPF scheduler "invariant_0.1.0_aarch64_unknown_linux_gnu_debug" enabled
> [ 229.923133] Kernel text patching generated an invalid instruction at 0xffff80009bc2c1f8!
> [ 229.923146] Internal error: Oops - BRK: 00000000f2000100 [#1] SMP
> [ 230.077871] CPU: 48 UID: 0 PID: 1760 Comm: kworker/u583:7 Not tainted 7.0.0+ #3 PREEMPT(full)
> [ 230.086677] Hardware name: NVIDIA GB200 NVL/P3809-BMC, BIOS 02.05.12 20251107
> [ 230.093972] Workqueue: events_unbound bpf_map_free_deferred
> [ 230.099675] Sched_ext: invariant_0.1.0_aarch64_unknown_linux_gnu_debug (disabling), task: runnable_at=-174ms
> [ 230.116843] pc : 0xffff80009bc2c1f8
> [ 230.120406] lr : dequeue_task_scx+0x270/0x2d0
> [ 230.217749] Call trace:
> [ 230.228515] 0xffff80009bc2c1f8 (P)
> [ 230.232077] dequeue_task+0x84/0x188
> [ 230.235728] sched_change_begin+0x1dc/0x250
> [ 230.240000] __set_cpus_allowed_ptr_locked+0x17c/0x240
> [ 230.245250] __set_cpus_allowed_ptr+0x74/0xf0
> [ 230.249701] ___migrate_enable+0x4c/0xa0
> [ 230.253707] bpf_map_free_deferred+0x1a4/0x1b0
> [ 230.258246] process_one_work+0x184/0x540
> [ 230.262342] worker_thread+0x19c/0x348
> [ 230.266170] kthread+0x13c/0x150
> [ 230.269465] ret_from_fork+0x10/0x20
> [ 230.281393] Code: d4202000 d4202000 d4202000 d4202000 (d4202000)
> [ 230.287621] ---[ end trace 0000000000000000 ]---
> [ 231.160046] Kernel panic - not syncing: Oops - BRK: Fatal exception in interrupt
>
> Best regards,
> Richard Cheng.
> ---
> kernel/sched/ext.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 012ca8bd70fb..065660382a0c 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -7349,6 +7349,12 @@ static void bpf_scx_unreg(void *kdata, struct bpf_link *link)
> struct scx_sched *sch = rcu_dereference_protected(ops->priv, true);
>
> scx_disable(sch, SCX_EXIT_UNREG);
> + /*
> + * sch->disable_work might still not queued, causing kthread_flush_work()
> + * as a noop. Syncing the irq_work first is required to guarantee the
nit, maybe rephrase: sch->disable_work might not have been queued yet, causing
kthread_flush_work() to be a no-op.
> + * kthread work has been queued before waiting for it.
> + */
> + irq_work_sync(&sch->disable_irq_work);
> kthread_flush_work(&sch->disable_work);
> RCU_INIT_POINTER(ops->priv, NULL);
> kobject_put(&sch->kobj);
> --
> 2.43.0
>
I can't reproduce it locally, but from a logical perspective it makes sense.
Nice catch!
Reviewed-by: Andrea Righi <arighi@xxxxxxxxxx>
Thanks,
-Andrea