Re: [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50us noise spikes on isolated PREEMPT_RT cores

From: Nam Cao

Date: Thu Apr 02 2026 - 00:43:13 EST

"Ionut Nechita (Wind River)" <ionut.nechita@xxxxxxxxxxxxx> writes:
> Crystal, Jan, Florian, thanks for the detailed feedback. I've redone
> all testing addressing each point raised. All tests below use HT
> disabled (sibling cores offlined), as Jan requested.
>
> Setup:
> - Hardware: Intel Xeon Gold 6338N (Ice Lake, single socket,
> 32 cores, HT disabled via sibling cores offlined)
> - Boot: nohz_full=1-16 isolcpus=nohz,domain,managed_irq,1-16
> rcu_nocbs=1-31 kthread_cpus=0 irqaffinity=17-31
> iommu=pt nmi_watchdog=0 intel_pstate=none skew_tick=1
> - eosnoise run with: ./osnoise -c 1-15
> - Duration: 120s per test
>
> Tested kernels (all vanilla, built from upstream sources):
> - 6.18.20-vanilla (non-RT, PREEMPT_DYNAMIC)
> - 6.18.20-vanilla (PREEMPT_RT, with and without rwlock revert)
> - 7.0.0-rc6-next-20260331 (PREEMPT_RT, with and without rwlock revert)
>
> I tested 6 configurations to isolate the exact failure mode:
>
> # Kernel Config Tool Revert Result
> -- --------------- -------- --------------- ------- ----------------
> 1 6.18.20 non-RT eosnoise no clean (100%)
> 2 6.18.20 RT eosnoise no D state (hung)
> 3 6.18.20 RT eosnoise yes clean (100%)
> 4 6.18.20 RT kernel osnoise no clean (99.999%)
> 5 7.0-rc6-next RT eosnoise no 93% avail, 57us
> 6 7.0-rc6-next RT eosnoise yes clean (99.99%)

Thanks for the detailed analysis.

> Key findings:
>
> 1. On 6.18.20-rt with spinlock, eosnoise hangs permanently in D state.
>
> The process blocks in do_epoll_ctl() during perf_buffer__new() setup
> (libbpf's perf_event_open + epoll_ctl loop). strace shows progressive
> degradation as fds are added to the epoll instance:
>
> CPU 0-13: epoll_ctl ~8 us (normal)
> CPU 14: epoll_ctl 16 ms (2000x slower)
> CPU 15: epoll_ctl 80 ms (10000x slower)
> CPU 16: epoll_ctl 80 ms
> CPU 17: epoll_ctl 20 ms
> CPU 18: epoll_ctl -- hung, never returns --
>
> Kernel stack of the hung process (3+ minutes in D state):
>
> [<0>] do_epoll_ctl+0xa57/0xf20
> [<0>] __x64_sys_epoll_ctl+0x5d/0xa0
> [<0>] do_syscall_64+0x7c/0xe30
> [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> 2. On 7.0-rc6-next-rt with spinlock, eosnoise runs but with severe
> noise. The difference from 6.18 is likely additional fixes in
> linux-next that prevent the complete deadlock but not the contention.
>
> 3. Kernel osnoise tracer (test #4) shows zero noise on the same
> 6.18.20-rt+spinlock kernel where eosnoise hangs. This confirms the
> issue is specifically in the epoll rt_mutex path, not in osnoise
> measurement methodology.
>
> Kernel osnoise output (6.18.20-rt, spinlock, no revert):
> 99.999% availability, 1-4 ns max noise, RES=6 total in 120s
>
> 4. Non-RT kernel (test #1) with the same spinlock change shows zero
> noise. This confirms the issue is the spinlock-to-rt_mutex conversion
> on PREEMPT_RT, not the spinlock change itself.
>
> IRQ deltas on isolated CPU1 (120s):
>
> 6.18.20-rt 6.18.20-rt 6.18.20 6.18.20-rt
> spinlock rwlock(rev) non-RT kernel osnoise
> RES (IPI): (D state) 3 1 6
> LOC (timer): (D state) 3,325 1,185 245
> IWI (irq work): (D state) 565,988 1,433 121
>
> 7.0-rc6-rt 7.0-rc6-rt
> spinlock rwlock(rev)
> RES (IPI): 330,000+ 2
> LOC (timer): 120,585 120,585
> IWI (irq work): 585,785 585,785
>
> The mechanism, refined:
>
> Crystal was right that this is specific to the BPF perf_event_output +
> epoll pattern, not any arbitrary epoll user. I verified this: a plain
> perf_event_open + epoll_ctl program without BPF does not trigger the
> issue.
>
> What triggers it is libbpf's perf_buffer__new(), which creates one
> PERF_COUNT_SW_BPF_OUTPUT perf_event per CPU, mmaps the ring buffer,
> and adds all fds to a single epoll instance. When BPF programs are
> attached to high-frequency tracepoints (irq_handler_entry/exit,
> softirq_entry/exit, sched_switch), every interrupt on every CPU calls
> bpf_perf_event_output() which invokes ep_poll_callback() under
> ep->lock.
>
> On PREEMPT_RT, ep->lock is an rt_mutex. With 15+ CPUs generating
> callbacks simultaneously into the same epoll instance, the rt_mutex
> PI mechanism creates unbounded contention. On 6.18 this results in
> a permanent D state hang. On 7.0 it results in ~330,000 reschedule
> IPIs hitting isolated cores over 120 seconds (~2,750/s per core).
>
> With rwlock, ep_poll_callback() uses read_lock which allows concurrent
> readers without cross-CPU contention — the callbacks execute in
> parallel without generating IPIs.

These IPIs do not exist without eosnoise running. eosnoise introduces
these noises into the system. For a noise tracer tool, it is certainly
eosnoise's responsibility to make sure it does not measure noises
originating from itself.

> This pattern (BPF tracepoint programs + perf ring buffer + epoll) is
> the standard architecture used by BCC tools (opensnoop, execsnoop,
> biolatency, tcpconnect, etc.), bpftrace, and any libbpf-based
> observability tool. A permanent D state hang when running such tools
> on PREEMPT_RT is a significant regression.

7.0-rc6-next is still using spin lock but has no hang problem. Likely
you are hitting a different problem here which appears when spin lock is
used, which has been fixed somewhere between 6.18.20 and 7.0-rc6-next.

If you still have the energy for it, a git bisect between 6.18.20 and
7.0-rc6-next will tell us which commit made the hang issue disappear.

> I'm not proposing a specific fix -- the previous suggestions
> (raw_spinlock trylock, lockless path) were rightly rejected. But the
> regression exists and needs to be addressed. The ep->lock contention
> under high-frequency BPF callbacks on PREEMPT_RT is a new problem
> that the rwlock->spinlock conversion introduced.
>
> Separate question: could eosnoise itself be improved to avoid this
> contention? For example, using one epoll instance per CPU instead of
> a single shared one, or using BPF ring buffer (BPF_MAP_TYPE_RINGBUF)
> instead of the per-cpu perf buffer which requires epoll. If the
> consensus is that the kernel side is working as intended and the tool
> should adapt, I'd like to understand what the recommended pattern is
> for BPF observability tools on PREEMPT_RT.

I am not familiar with eosnoise, I can't tell you. I tried compiling
eosnoise but that failed. I managed to fix the compile failure, then I
got run-time failure.

It depends on what eosnoise is using epoll for. If it is just waiting
for PERF_COUNT_SW_BPF_OUTPUT to happen, perhaps we can change to some
sort of polling implementation (e.g. wake up every 100ms to check for
data).

Best regards,
Nam