Re: [PATCH] sched_ext: sync disable_irq_work in bpf_scx_unreg()

From: Richard Cheng

Date: Thu Apr 23 2026 - 23:30:49 EST

On Wed, Apr 22, 2026 at 06:51:13PM +0800, Cheng-Yang Chou wrote:
> Hi Richard,
>
> On Wed, Apr 22, 2026 at 06:09:38PM +0800, Richard Cheng wrote:
> > When unregistered my self-written scx scheduler, the following panic
> > occurs [1].
>
> Nit: you've placed the panic log [1] below the --- separator. Content
> below this line will not be preserved in the comment msg.
> Could you please move it into the commit msg body and send a v2 patch?
>

Hi Cheng-Yang,

Sure thing, thanks.

> >
> > The root cause is that the JIT page backing ops->quiescent() is freed
> > before all callers of that function have stopped.
> >
> > The expected ordering during teardown is:
> > bitmap_zero(sch->has_op) + synchronize_rcu()
> > -> guarantees no CPU will ever call sch->ops.* again
> > -> only THEN free the BPF struct_ops JIT page
> >
> > bpf_scx_unreg() is supposed to enforce the order, but after
> > commit f4a6c506d118 ("sched_ext: Always bounce scx_disable() through
> > irq_work"), disable_work is no longer queued directly, causing
> > kthread_flush_work() to be a noop. Thus, the caller drops the struct_ops
> > map too early and poisoned with AARCH64_BREAK_FAULT before
> > disable_workfn ever execute.
> >
> > So the subsequent dequeue_task() still sees SCX_HAS_OP(sch, quiescent)
> > as true and calls ops.quiescent, which hit on the poisoned page and BRK
> > panic.
> >
> > Fix it by syncing disable_irq_work first, so disable_work is guaranteed
> > to be queued before waiting for it.
> >
> > Fixes: f4a6c506d118 ("sched_ext: Always bounce scx_disable() through irq_work")
> > Signed-off-by: Richard Cheng <icheng@xxxxxxxxxx>
>
> Thanks for the fix, and the logic looks correct to me.
>
> Reviewed-by: Cheng-Yang Chou <yphbchou0911@xxxxxxxxx>
>
> Also, scx_root_enable_workfn() has the same pattern in its error path:
>
> scx_error(sch, "scx_root_enable() failed (%d)", ret);
> kthread_flush_work(&sch->disable_work);
>
> The comment above indicates that this flush is meant to "ensure that
> error is reported before init completion". However, because scx_error()
> goes through scx_vexit() -> irq_work_queue(), the flush can be a no-op
> here as well. The same applies to the sub-scheduler enable error path.
>
> Should those be fixed in the same patch? Tejun, Andrea, wdyt? Thanks
>

Sure I'll amend that part in v2.

> [...]
> > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > index 012ca8bd70fb..065660382a0c 100644
> > --- a/kernel/sched/ext.c
> > +++ b/kernel/sched/ext.c
> > @@ -7349,6 +7349,12 @@ static void bpf_scx_unreg(void *kdata, struct bpf_link *link)
> > struct scx_sched *sch = rcu_dereference_protected(ops->priv, true);
> >
> > scx_disable(sch, SCX_EXIT_UNREG);
> > + /*
> > + * sch->disable_work might still not queued, causing kthread_flush_work()
> > + * as a noop. Syncing the irq_work first is required to guarantee the
>
> Perhaps s/noop/no-op/? Though it's just a matter of taste. ^_^
>
> > + * kthread work has been queued before waiting for it.
> > + */
> > + irq_work_sync(&sch->disable_irq_work);
> > kthread_flush_work(&sch->disable_work);
> > RCU_INIT_POINTER(ops->priv, NULL);
> > kobject_put(&sch->kobj);
> > --
> > 2.43.0
> >
> >
>
> --
> Cheers,
> Cheng-Yang

- Richard