Re: [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50us noise spikes on isolated PREEMPT_RT cores

From: Ionut Nechita (Wind River)

Date: Fri Mar 27 2026 - 14:38:18 EST


From: Ionut Nechita <ionut.nechita@xxxxxxxxxxxxx>

On Thu, 2026-03-27 at 08:44 +0100, Florian Bezdeka wrote:
> A revert alone is not an option as it would bring back [1] and [2]
> for all LTS releases that did not receive [3].

Florian, Crystal, thanks for the feedback.

I understand the revert concern regarding the CFS throttle deadlock.
However, I want to clarify that the noise regression on isolated cores
is a separate issue from the deadlock fixed by [3], and it remains
unfixed even on linux-next which has [3] merged or not.

I've done extensive testing across multiple kernels to identify the
exact mechanism. Here are the results.

Tool: eBPF-based osnoise tracer (https://gitlab.com/rt-linux-tools/eosnoise)
which uses perf_event_open() + epoll on each monitored CPU, combined
with /proc/interrupts delta measurement.

Setup:
- Hardware: x86_64, SMT/HT enabled (CPUs 0-63)
- Boot: nohz_full=1-16,33-48 isolcpus=nohz,domain,managed_irq,1-16,33-48
rcu_nocbs=1-31,33-63 kthread_cpus=0,32 irqaffinity=17-31,49-63
- Duration: 120s per test

IRQ delta on isolated CPUs (representative CPU1, 120s sample):

6.12.79-rt 6.18.20-rt 7.0-rc5-next-rt 6.18.19-rt 7.0-rc5-next-rt
spinlock spinlock spinlock rwlock(rev) rwlock(rev)
RES (IPI): 324,279 323,864 321,594 0 1
LOC (timer): 50,827 53,995 59,793 125,791 125,791
IWI (irq work): 359,590 357,289 357,798 588,245 588,245

osnoise on isolated CPUs (per 950ms sample):

6.12.79-rt 6.18.20-rt 7.0-rc5-next-rt 6.18.19-rt 7.0-rc5-next-rt
spinlock spinlock spinlock rwlock(rev) rwlock(rev)
MAX noise (ns): ~57,000 ~57,000 ~57,000 ~9 ~140
IRQ/sample: ~7,280 ~7,030 ~7,020 ~1 ~961
Thread/sample: ~6,330 ~6,090 ~6,090 ~1 ~1
Availability: ~93.5% ~93.5% ~93.5% ~100% ~99.99%

The smoking gun is RES (reschedule IPI): ~322,000 on every isolated CPU
in 120 seconds with the spinlock, essentially zero with rwlock. That is
~2,680 reschedule IPIs per second hitting each isolated core.

The mechanism: on PREEMPT_RT, spinlock_t becomes rt_mutex. When the
eBPF osnoise tool (or any BPF/perf tool using epoll) calls
epoll_ctl(EPOLL_CTL_ADD) for perf events on each CPU, ep_poll_callback()
runs under ep->lock (now rt_mutex) in IRQ context. The rt_mutex PI
mechanism sends reschedule IPIs to wake waiters, which hit isolated
cores. With rwlock, read_lock() in ep_poll_callback() does not generate
cross-CPU IPIs.

Note on the tool: the eBPF osnoise tracer itself creates epoll activity
on all CPUs via perf_event_open() + epoll_ctl(). This is representative
of real-world scenarios where any BPF/perf monitoring tool, or system
services like systemd/journald using epoll, would trigger the same
regression on isolated cores.

When using the kernel's built-in osnoise tracer (which does not use
epoll), isolated cores show 1ns noise / 1 IRQ per sample on all kernels
regardless of spinlock vs rwlock — confirming the noise source is
specifically the epoll spinlock contention path.

Key finding: the task-based CFS throttle series [3] (Aaron Lu, merged
in 6.18/linux-next) does NOT fix this issue. The regression is identical
on 6.12, 6.18, and linux-next 7.0-rc5 with the spinlock. Only reverting
to rwlock eliminates it.

To answer Crystal's question "when would you ever reach that path on an
isolated CPU?" — the answer is: any tool or service that uses
perf_event_open() + epoll across all CPUs (BPF tools, perf, monitoring
agents) will trigger ep_poll_callback() on isolated CPUs. On RT with the
spinlock, this generates ~2,680 reschedule IPIs/s per isolated core.

The eventpoll spinlock noise regression needs its own fix — perhaps
a lockless path in ep_poll_callback() for the RT case, or
converting ep->lock to a raw_spinlock with trylock semantics to avoid
the rt_mutex IPI overhead.

Ionut