Re: [PATCH sched_ext/for-7.1-fixes] sched_ext: Fix ops->priv NULL pointer deref in bpf_scx_unreg()

From: Tejun Heo

Date: Sun May 10 2026 - 22:57:13 EST


Hello, Andrea.

I traced reload_loop with per-CPU ring probes around all @ops->priv
and scx_root assign/clear sites. The race is a stomp:

T2 unreg(K) T1 reg(K)
----------- ---------
sch = ops->priv = sch_b800
scx_disable; flush_disable_work
[scx_root_disable: scx_root=NULL,
mutex_unlock, state=DISABLED]
mutex_lock; state ok
scx_alloc_and_add_sched:
ops->priv = sch_a800
scx_root = sch_a800; init=0
state=ENABLED; mutex_unlock
[flush returns]
RCU_INIT_POINTER(ops->priv, NULL) <-- clobbers sch_a800
kobject_put(sch_b800)

Reachable because the unreg waits on sch->helper while the next reg
runs on the global scx_enable_helper, and scx_enable_mutex is released
inside scx_root_disable() well before bpf_scx_unreg() reaches its
RCU_INIT_POINTER. My trace caught 11us between PRIV_SET sch_a800 and
the clobber; nothing bounds it.

The posted patch suppresses the deref but leaves the stomp. Each
stomp leaks one sch (the "sch's base reference will be put by
bpf_scx_unreg()" contract assumes ops->priv still points at it), and
in the case I caught, sch_a800 is already SCX_ENABLED with scx_root
pointing at it - the bpf_link is gone but state stays ENABLED, so all
future attaches fail with -EBUSY permanently.

Suggestion: make @ops->priv the lifecycle binding. In
scx_root_enable_workfn() (and scx_sub_enable_workfn()), after the
existing state check and still under scx_enable_mutex, refuse with
-EBUSY if @ops->priv is non-NULL. Unreg side keeps its current
ordering.

One question: are there other paths that write or clear @ops->priv?
I only see the rcu_assign_pointer in scx_alloc_and_add_sched and the
RCU_INIT_POINTER(NULL) in bpf_scx_unreg().

Thanks.

--
tejun