Re: [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50us noise spikes on isolated PREEMPT_RT cores

From: Crystal Wood

Date: Fri Mar 27 2026 - 17:20:27 EST

On Fri, 2026-03-27 at 20:36 +0200, Ionut Nechita (Wind River) wrote:
> From: Ionut Nechita <ionut.nechita@xxxxxxxxxxxxx>
>
> On Thu, 2026-03-27 at 08:44 +0100, Florian Bezdeka wrote:
> > A revert alone is not an option as it would bring back [1] and [2]
> > for all LTS releases that did not receive [3].
>
> Florian, Crystal, thanks for the feedback.
>
> I understand the revert concern regarding the CFS throttle deadlock.
> However, I want to clarify that the noise regression on isolated cores
> is a separate issue from the deadlock fixed by [3], and it remains
> unfixed even on linux-next which has [3] merged or not.

Nobody's saying that [3] would fix your issue. They're saying that the
deadlock issue is the reason why simply reverting the epoll change is
not acceptable, at least on kernels without [3].

> I've done extensive testing across multiple kernels to identify the
> exact mechanism. Here are the results.
>
> Tool: eBPF-based osnoise tracer (https://gitlab.com/rt-linux-tools/eosnoise)
> which uses perf_event_open() + epoll on each monitored CPU, combined
> with /proc/interrupts delta measurement.

I recommend sticking with the kernel's osnoise (with or without rtla).

Besides the IPI issue, it doesn't look like eosnoise is being
maintained anymore, ever since osnoise went into the kernel.

> Setup:
> - Hardware: x86_64, SMT/HT enabled (CPUs 0-63)
> - Boot: nohz_full=1-16,33-48 isolcpus=nohz,domain,managed_irq,1-16,33-48
> rcu_nocbs=1-31,33-63 kthread_cpus=0,32 irqaffinity=17-31,49-63
> - Duration: 120s per test
>
> IRQ delta on isolated CPUs (representative CPU1, 120s sample):
>
> 6.12.79-rt 6.18.20-rt 7.0-rc5-next-rt 6.18.19-rt 7.0-rc5-next-rt
> spinlock spinlock spinlock rwlock(rev) rwlock(rev)
> RES (IPI): 324,279 323,864 321,594 0 1
> LOC (timer): 50,827 53,995 59,793 125,791 125,791
> IWI (irq work): 359,590 357,289 357,798 588,245 588,245
>
> osnoise on isolated CPUs (per 950ms sample):
>
> 6.12.79-rt 6.18.20-rt 7.0-rc5-next-rt 6.18.19-rt 7.0-rc5-next-rt
> spinlock spinlock spinlock rwlock(rev) rwlock(rev)
> MAX noise (ns): ~57,000 ~57,000 ~57,000 ~9 ~140
> IRQ/sample: ~7,280 ~7,030 ~7,020 ~1 ~961
> Thread/sample: ~6,330 ~6,090 ~6,090 ~1 ~1
> Availability: ~93.5% ~93.5% ~93.5% ~100% ~99.99%
>
> The smoking gun is RES (reschedule IPI): ~322,000 on every isolated CPU
> in 120 seconds with the spinlock, essentially zero with rwlock. That is
> ~2,680 reschedule IPIs per second hitting each isolated core.
>
> The mechanism: on PREEMPT_RT, spinlock_t becomes rt_mutex. When the
> eBPF osnoise tool (or any BPF/perf tool using epoll) calls
> epoll_ctl(EPOLL_CTL_ADD) for perf events on each CPU,

I don't see BPF calls from the inner loop of osnoise_main(). There are
BPF hooks for various interruptions... I'm guessing there's a loop
where each hook causes an IPI that causes another BPF hook. I
wouldn't have expected a wakeup for every sample, but it seems like
that's the default specified by libbpf (eosnoise doesn't set
sample_period).

> ep_poll_callback()
> runs under ep->lock (now rt_mutex) in IRQ context. The rt_mutex PI
> mechanism sends reschedule IPIs to wake waiters, which hit isolated
> cores. With rwlock, read_lock() in ep_poll_callback() does not generate
> cross-CPU IPIs.

Because it doesn't need to block in the first place (unless there's a
writer).

> Note on the tool: the eBPF osnoise tracer itself creates epoll activity
> on all CPUs via perf_event_open() + epoll_ctl(). This is representative
> of real-world scenarios where any BPF/perf monitoring tool, or system
> services like systemd/journald using epoll, would trigger the same
> regression on isolated cores.

Using BPF to hook IRQ entry/exit isn't representative of real-world
scenarios. Assuming I'm right about the underlying cause, this is an
issue with eosnoise, that the epoll change exacerbates.

> When using the kernel's built-in osnoise tracer (which does not use
> epoll), isolated cores show 1ns noise / 1 IRQ per sample on all kernels
> regardless of spinlock vs rwlock — confirming the noise source is
> specifically the epoll spinlock contention path.
>
> Key finding: the task-based CFS throttle series [3] (Aaron Lu, merged
> in 6.18/linux-next) does NOT fix this issue. The regression is identical
> on 6.12, 6.18, and linux-next 7.0-rc5 with the spinlock. Only reverting
> to rwlock eliminates it.
>
> To answer Crystal's question "when would you ever reach that path on an
> isolated CPU?" — the answer is: any tool or service that uses
> perf_event_open() + epoll across all CPUs (BPF tools, perf, monitoring
> agents) will trigger ep_poll_callback() on isolated CPUs. On RT with the
> spinlock, this generates ~2,680 reschedule IPIs/s per isolated core.

Keep in mind that if you use kernel services, you can't expect perfect
isolation, or to never block on a mutex or get a callback -- but this
eosnoise issue does not mean that any perf_event_open() + epoll user
will be getting thousands of IPIs per second.

> The eventpoll spinlock noise regression needs its own fix — perhaps
> a lockless path in ep_poll_callback() for the RT case, or

Again, if you mean the old lockless path, RT is exactly where we don't
want that. What would be the reason to do this *only* for RT?

> converting ep->lock to a raw_spinlock with trylock semantics to avoid
> the rt_mutex IPI overhead.

Among other problems (what happens if the trylock fails? why a trylock
in the first place?), you can't call wake_up() with a raw lock held.
It has its own non-raw spinlock.

-Crystal