Re: [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller

From: K Prateek Nayak
Date: Wed Apr 09 2025 - 05:29:40 EST


(+ Aaron)

Hello Jan,

On 4/9/2025 12:11 PM, Jan Kiszka wrote:
On 12.10.23 17:07, Valentin Schneider wrote:
Hi folks,

We've had reports of stalls happening on our v6.0-ish frankenkernels, and while
we haven't been able to come out with a reproducer (yet), I don't see anything
upstream that would prevent them from happening.

The setup involves eventpoll, CFS bandwidth controller and timer
expiry, and the sequence looks as follows (time-ordered):

p_read (on CPUn, CFS with bandwidth controller active)
======

ep_poll_callback()
read_lock_irqsave()
...
try_to_wake_up() <- enqueue causes an update_curr() + sets need_resched
due to having no more runtime
preempt_enable()
preempt_schedule() <- switch out due to p_read being now throttled

p_write
=======

ep_poll()
write_lock_irq() <- blocks due to having active readers (p_read)

ktimers/n
=========

timerfd_tmrproc()
`\
ep_poll_callback()
`\
read_lock_irqsave() <- blocks due to having active writer (p_write)


From this point we have a circular dependency:

p_read -> ktimers/n (to replenish runtime of p_read)
ktimers/n -> p_write (to let ktimers/n acquire the readlock)
p_write -> p_read (to let p_write acquire the writelock)

IIUC reverting
286deb7ec03d ("locking/rwbase: Mitigate indefinite writer starvation")
should unblock this as the ktimers/n thread wouldn't block, but then we're back
to having the indefinite starvation so I wouldn't necessarily call this a win.

Two options I'm seeing:
- Prevent p_read from being preempted when it's doing the wakeups under the
readlock (icky)
- Prevent ktimers / ksoftirqd (*) from running the wakeups that have
ep_poll_callback() as a wait_queue_entry callback. Punting that to e.g. a
kworker /should/ do.

(*) It's not just timerfd, I've also seen it via net::sock_def_readable -
it should be anything that's pollable.

I'm still scratching my head on this, so any suggestions/comments welcome!


We are hunting for quite some time sporadic lock-ups or RT systems,
first only in the field (sigh), now finally also in the lab. Those have
a fairly high overlap with what was described here. Our baselines so
far: 6.1-rt, Debian and vanilla. We are currently preparing experiments
with latest mainline.

Do the backtrace from these lockups show tasks (specifically ktimerd)
waiting on a rwsem? Throttle deferral helps if cfs bandwidth throttling
becomes the reason for long delay / circular dependency. Is cfs bandwidth
throttling being used on these systems that run into these lockups?
Otherwise, your issue might be completely different.


While this thread remained silent afterwards, we have found [1][2][3] as
apparently related. But this means we are still with this RT bug, even
in latest 6.15-rc1?

I'm pretty sure a bunch of locking related stuff has been reworked to
accommodate PREEMPT_RT since v6.1. Many rwsem based locking patterns
have been replaced with alternatives like RCU. Recently introduced
dl_server infrastructure also helps prevent starvation of fair tasks
which can allow progress and prevent lockups. I would recommend
checking if the most recent -rt release can still reproduce your
issue:
https://lore.kernel.org/lkml/20250331095610.ulLtPP2C@xxxxxxxxxxxxx/

Note: Aaron Lu is working on Valentin's approach of deferring cfs
throttling to exit to user mode boundary
https://lore.kernel.org/lkml/20250313072030.1032893-1-ziqianlu@xxxxxxxxxxxxx/

If you still run into the issue of a lockup / long latencies on latest
-rt release and your system is using cfs bandwidth controls, you can
perhaps try running with Valentin's or Aaron's series to check if
throttle deferral helps your scenario.


Jan

[1] https://lore.kernel.org/lkml/20231030145104.4107573-1-vschneid@xxxxxxxxxx/
[2] https://lore.kernel.org/lkml/20240202080920.3337862-1-vschneid@xxxxxxxxxx/
[3] https://lore.kernel.org/lkml/20250220093257.9380-1-kprateek.nayak@xxxxxxx/

I'm mostly testing and reviewing Aaron's series now since per-task
throttling seems to be the way forward based on discussions in the
community.



--
Thanks and Regards,
Prateek