Re: [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50us noise spikes on isolated PREEMPT_RT cores

From: Ionut Nechita (Wind River)

Date: Wed Apr 01 2026 - 13:21:21 EST

From: Ionut Nechita <ionut.nechita@xxxxxxxxxxxxx>

Crystal, Jan, Florian, thanks for the detailed feedback. I've redone
all testing addressing each point raised. All tests below use HT
disabled (sibling cores offlined), as Jan requested.

Setup:
- Hardware: Intel Xeon Gold 6338N (Ice Lake, single socket,
32 cores, HT disabled via sibling cores offlined)
- Boot: nohz_full=1-16 isolcpus=nohz,domain,managed_irq,1-16
rcu_nocbs=1-31 kthread_cpus=0 irqaffinity=17-31
iommu=pt nmi_watchdog=0 intel_pstate=none skew_tick=1
- eosnoise run with: ./osnoise -c 1-15
- Duration: 120s per test

Tested kernels (all vanilla, built from upstream sources):
- 6.18.20-vanilla (non-RT, PREEMPT_DYNAMIC)
- 6.18.20-vanilla (PREEMPT_RT, with and without rwlock revert)
- 7.0.0-rc6-next-20260331 (PREEMPT_RT, with and without rwlock revert)

I tested 6 configurations to isolate the exact failure mode:

# Kernel Config Tool Revert Result
-- --------------- -------- --------------- ------- ----------------
1 6.18.20 non-RT eosnoise no clean (100%)
2 6.18.20 RT eosnoise no D state (hung)
3 6.18.20 RT eosnoise yes clean (100%)
4 6.18.20 RT kernel osnoise no clean (99.999%)
5 7.0-rc6-next RT eosnoise no 93% avail, 57us
6 7.0-rc6-next RT eosnoise yes clean (99.99%)

Key findings:

1. On 6.18.20-rt with spinlock, eosnoise hangs permanently in D state.

The process blocks in do_epoll_ctl() during perf_buffer__new() setup
(libbpf's perf_event_open + epoll_ctl loop). strace shows progressive
degradation as fds are added to the epoll instance:

CPU 0-13: epoll_ctl ~8 us (normal)
CPU 14: epoll_ctl 16 ms (2000x slower)
CPU 15: epoll_ctl 80 ms (10000x slower)
CPU 16: epoll_ctl 80 ms
CPU 17: epoll_ctl 20 ms
CPU 18: epoll_ctl -- hung, never returns --

Kernel stack of the hung process (3+ minutes in D state):

[<0>] do_epoll_ctl+0xa57/0xf20
[<0>] __x64_sys_epoll_ctl+0x5d/0xa0
[<0>] do_syscall_64+0x7c/0xe30
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

2. On 7.0-rc6-next-rt with spinlock, eosnoise runs but with severe
noise. The difference from 6.18 is likely additional fixes in
linux-next that prevent the complete deadlock but not the contention.

3. Kernel osnoise tracer (test #4) shows zero noise on the same
6.18.20-rt+spinlock kernel where eosnoise hangs. This confirms the
issue is specifically in the epoll rt_mutex path, not in osnoise
measurement methodology.

Kernel osnoise output (6.18.20-rt, spinlock, no revert):
99.999% availability, 1-4 ns max noise, RES=6 total in 120s

4. Non-RT kernel (test #1) with the same spinlock change shows zero
noise. This confirms the issue is the spinlock-to-rt_mutex conversion
on PREEMPT_RT, not the spinlock change itself.

IRQ deltas on isolated CPU1 (120s):

6.18.20-rt 6.18.20-rt 6.18.20 6.18.20-rt
spinlock rwlock(rev) non-RT kernel osnoise
RES (IPI): (D state) 3 1 6
LOC (timer): (D state) 3,325 1,185 245
IWI (irq work): (D state) 565,988 1,433 121

7.0-rc6-rt 7.0-rc6-rt
spinlock rwlock(rev)
RES (IPI): 330,000+ 2
LOC (timer): 120,585 120,585
IWI (irq work): 585,785 585,785

The mechanism, refined:

Crystal was right that this is specific to the BPF perf_event_output +
epoll pattern, not any arbitrary epoll user. I verified this: a plain
perf_event_open + epoll_ctl program without BPF does not trigger the
issue.

What triggers it is libbpf's perf_buffer__new(), which creates one
PERF_COUNT_SW_BPF_OUTPUT perf_event per CPU, mmaps the ring buffer,
and adds all fds to a single epoll instance. When BPF programs are
attached to high-frequency tracepoints (irq_handler_entry/exit,
softirq_entry/exit, sched_switch), every interrupt on every CPU calls
bpf_perf_event_output() which invokes ep_poll_callback() under
ep->lock.

On PREEMPT_RT, ep->lock is an rt_mutex. With 15+ CPUs generating
callbacks simultaneously into the same epoll instance, the rt_mutex
PI mechanism creates unbounded contention. On 6.18 this results in
a permanent D state hang. On 7.0 it results in ~330,000 reschedule
IPIs hitting isolated cores over 120 seconds (~2,750/s per core).

With rwlock, ep_poll_callback() uses read_lock which allows concurrent
readers without cross-CPU contention — the callbacks execute in
parallel without generating IPIs.

This pattern (BPF tracepoint programs + perf ring buffer + epoll) is
the standard architecture used by BCC tools (opensnoop, execsnoop,
biolatency, tcpconnect, etc.), bpftrace, and any libbpf-based
observability tool. A permanent D state hang when running such tools
on PREEMPT_RT is a significant regression.

I'm not proposing a specific fix -- the previous suggestions
(raw_spinlock trylock, lockless path) were rightly rejected. But the
regression exists and needs to be addressed. The ep->lock contention
under high-frequency BPF callbacks on PREEMPT_RT is a new problem
that the rwlock->spinlock conversion introduced.

Separate question: could eosnoise itself be improved to avoid this
contention? For example, using one epoll instance per CPU instead of
a single shared one, or using BPF ring buffer (BPF_MAP_TYPE_RINGBUF)
instead of the per-cpu perf buffer which requires epoll. If the
consensus is that the kernel side is working as intended and the tool
should adapt, I'd like to understand what the recommended pattern is
for BPF observability tools on PREEMPT_RT.

Ionut