Re: [PATCH] softirq: WARN_ON !preemptible() not check softirq cnt in bh disable on RT

From: Sebastian Andrzej Siewior

Date: Thu Mar 12 2026 - 06:08:51 EST

On 2026-03-12 01:01:15 [+0800], Xin Zhao wrote:
> hi, Sebastian
Hi,

> As you said, the current implementation is good enough. :)
> If you think it’s appropriate to change it to (system_state != SYSTEM_BOOTING), you can make
> that change later when you git rid of CONFIG_PREEMPT_RT_NEEDS_BH_LOCK. :)

If I get rid of CONFIG_PREEMPT_RT_NEEDS_BH_LOCK then
!CONFIG_PREEMPT_RT_NEEDS_BH_LOCK becomes the only code and the code in
question will vanish.

> > Funny story: I did a grep for the pattern you described and this s390
> > driver was the only thing that popped up.
>
> I'm actually curious why the users of _local_bh_enable, specifically those using the s390
> driver, haven't raised the issue that this interface cannot be used in RT-linux. Could it be
> that s390 users have never run on RT-linux?

This driver is very old and s390 does not support PREEMPT_RT. You can
grep for ARCH_SUPPORTS_RT to see who supports it.

> > > Since you also mentioned that later CONFIG_PREEMPT_RT_NEEDS_BH_LOCK will no longer be
> > > enabled, at that point, local_bh_disable almost loses its significance. I think it
> > > should either be removed or implemented as a no-op, as it no longer achieves our
> > > expected effect, and it would be better to save some instruction execution time.
> >
> > We can't nop it entirely. local_bh_disable() needs remain a RCU read
> > section and it needs to ensure that the context does not wonder off to
> > another CPU. Also we need to count the disable/enable because once we go
> > back to zero, we need to run callbacks which may have queued up.
>
> I did overlook that local_bh_disable() is also considered an RCU critical section and is
> used in conjunction with rcu_read_lock_bh(). Although I saw comments in the code like
> "/* Required to meet the RCU bottomhalf requirements. */", I don't fully understand why
> local_bh_disable must be treated as an RCU read critical section. Is it simply because the
> implementation of rcu_read_lock_bh does not directly call __rcu_read_lock and instead relies
> on local_bh_disable to proxy this call? I haven't figured this out, and it seems a bit
> strange to me.

local_bh_disable() becomes an implicit RCU read lock section on
!PREEMPT_RT and be must preserve the semantic.

> > And if we queue the softirq on per-task basis rather then per-CPU then
> > we don't have the problem that one task completes softirqs queued by
> > another one.
>
> Are you suggesting that the future implementation of soft interrupts might be optimized to
> use a per-task approach for queuing and processing soft interrupts? I think this is a very
> good attempt, as the current handling of soft interrupts is a bit chaotic. High-priority
> tasks often end up passively dealing with many low-priority soft interrupt tasks during
> local_bh_disable(), effectively acting as 'ksoftirqd'. This seems unreasonable to me, as
> it elevates the priority of low-priority tasks for processing.

Yes. Getting rid of that BH lock removed much of the pain. This would one additional piece.

> If soft interrupt handling could be implemented in a per-task manner, it could even lead to
> priority inheritance in the future, and possibly work in conjunction with BH workqueues to
> thoroughly resolve the long-standing issues of soft interrupts in RT-linux. In my project,
> performance problems are often related to __local_bh_disable_ip and various sporadic
> latency spikes caused by migrate_disable(). This is quite frustrating.

Ideally if task X queues soft interrupts, it handles them and a later
task does not observe them. Only a task with higher priority can add
additional softirq work.
If task X queues BLOCK and gets preempted, task Y with higher priority
adds NET_RX, then task Y will handle NET_RX and BLOCK. This can be
avoided by handling the softirqs per-task.
However if both raise NET_RX then task Y will still handle both. This is
because both use the same data structure to queue work, in this case the
list of pending napi devices. In this case threaded napi would work
because it avoids the common data structure.

I am not a big fan of the BH workqueues because you queue work items in
context in which it originates and then it "vanishes". So all the
priorities and so on are gone. Also the work from lower priority tasks
gets mixed with high priority tasks. Not something you desire in
general.
In general you are better off remaining in the threaded interrupt,
completing the work.

> Xin Zhao

Sebastian