Re: [RFC][PATCH] sched_ext: Allow consuming local tasks when aborting
From: Tejun Heo
Date: Fri May 08 2026 - 11:28:37 EST
Hello,
On Thu, May 07, 2026 at 02:56:42PM +0100, Christian Loehle wrote:
> 1. The BPF scheduler's cpu_offline callback calls scx_bpf_exit(),
> setting sch->aborting and queuing the disable_work on the helper
> kthread.
>
> 2. The helper kthread (and other tasks) are stuck on the global or
> user DSQs because bypass mode hasn't been entered yet.
The helper thread runs RT class, so it doesn't go through SCX at all. Can
you try Andrea's patch?
> RFC:
> I guess this reintroduces the live-lock of a BPF scheduler having a
> highly contended DSQ with a lot of tasks and the outer loop holding
> dsq->lock and therefore it still taking too long for the bypass to
> activate, is there a better way?
> I also couldn't trigger a lockup through that, did I just not have
> the right platform (e.g. 2x Intel 8480c). Should we add a selftest
> for this too, then?
Dual Sapphire Rapids is where the problem was initially observed and I could
also reproduce on dual socket Zen 2 too. SPRs are way more susceptible tho.
I *think* I was running scx_simple with some mixture of saturating
stress-ng. It wasn't that difficult to reproduce. We should probably
document the repro somewhere. I'm not sure selftests is a good place to host
this sort of repros.
Thanks.
--
tejun