Re: [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller

From: K Prateek Nayak
Date: Mon Apr 14 2025 - 10:58:12 EST


Hello Jan,

On 4/9/2025 7:14 PM, Jan Kiszka wrote:
On 09.04.25 14:13, Aaron Lu wrote:
On Wed, Apr 09, 2025 at 02:59:18PM +0530, K Prateek Nayak wrote:
(+ Aaron)

Thank you Prateek for bring me in.

Hello Jan,

On 4/9/2025 12:11 PM, Jan Kiszka wrote:
On 12.10.23 17:07, Valentin Schneider wrote:
Hi folks,

We've had reports of stalls happening on our v6.0-ish frankenkernels, and while
we haven't been able to come out with a reproducer (yet), I don't see anything
upstream that would prevent them from happening.

The setup involves eventpoll, CFS bandwidth controller and timer
expiry, and the sequence looks as follows (time-ordered):

p_read (on CPUn, CFS with bandwidth controller active)
======

ep_poll_callback()
read_lock_irqsave()
...
try_to_wake_up() <- enqueue causes an update_curr() + sets need_resched
due to having no more runtime
preempt_enable()
preempt_schedule() <- switch out due to p_read being now throttled

p_write
=======

ep_poll()
write_lock_irq() <- blocks due to having active readers (p_read)

ktimers/n
=========

timerfd_tmrproc()
`\
ep_poll_callback()
`\
read_lock_irqsave() <- blocks due to having active writer (p_write)


From this point we have a circular dependency:

p_read -> ktimers/n (to replenish runtime of p_read)
ktimers/n -> p_write (to let ktimers/n acquire the readlock)
p_write -> p_read (to let p_write acquire the writelock)

IIUC reverting
286deb7ec03d ("locking/rwbase: Mitigate indefinite writer starvation")
should unblock this as the ktimers/n thread wouldn't block, but then we're back
to having the indefinite starvation so I wouldn't necessarily call this a win.

Two options I'm seeing:
- Prevent p_read from being preempted when it's doing the wakeups under the
readlock (icky)
- Prevent ktimers / ksoftirqd (*) from running the wakeups that have
ep_poll_callback() as a wait_queue_entry callback. Punting that to e.g. a
kworker /should/ do.

(*) It's not just timerfd, I've also seen it via net::sock_def_readable -
it should be anything that's pollable.

I'm still scratching my head on this, so any suggestions/comments welcome!


We are hunting for quite some time sporadic lock-ups or RT systems,
first only in the field (sigh), now finally also in the lab. Those have
a fairly high overlap with what was described here. Our baselines so
far: 6.1-rt, Debian and vanilla. We are currently preparing experiments
with latest mainline.

Do the backtrace from these lockups show tasks (specifically ktimerd)
waiting on a rwsem? Throttle deferral helps if cfs bandwidth throttling
becomes the reason for long delay / circular dependency. Is cfs bandwidth
throttling being used on these systems that run into these lockups?
Otherwise, your issue might be completely different.

Agree.


While this thread remained silent afterwards, we have found [1][2][3] as
apparently related. But this means we are still with this RT bug, even
in latest 6.15-rc1?

I'm pretty sure a bunch of locking related stuff has been reworked to
accommodate PREEMPT_RT since v6.1. Many rwsem based locking patterns
have been replaced with alternatives like RCU. Recently introduced
dl_server infrastructure also helps prevent starvation of fair tasks
which can allow progress and prevent lockups. I would recommend
checking if the most recent -rt release can still reproduce your
issue:
https://lore.kernel.org/lkml/20250331095610.ulLtPP2C@xxxxxxxxxxxxx/

Note: Aaron Lu is working on Valentin's approach of deferring cfs
throttling to exit to user mode boundary
https://lore.kernel.org/lkml/20250313072030.1032893-1-ziqianlu@xxxxxxxxxxxxx/

If you still run into the issue of a lockup / long latencies on latest
-rt release and your system is using cfs bandwidth controls, you can
perhaps try running with Valentin's or Aaron's series to check if
throttle deferral helps your scenario.

I just sent out v2 :-)
https://lore.kernel.org/all/20250409120746.635476-1-ziqianlu@xxxxxxxxxxxxx/

Hi Jan,

If you want to give it a try, please try v2.


Thanks, we are updating our setup right now.

BTW, does anyone already have a test case that produces the lockup issue
with one or two simple programs and some hectic CFS bandwidth settings?

This is your cue to grab a brown paper bag since what I'm about to paste
below is probably lifetime without parole in the RT land but I believe
it gets close to the scenario described by Valentin:

(Based on v6.15-rc1; I haven't yet tested this with Aaron's series yet)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e43993a4e580..7ed0a4923ca2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6497,6 +6497,7 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
int count = 0;
raw_spin_lock_irqsave(&cfs_b->lock, flags);
+ pr_crit("sched_cfs_period_timer: Started on CPU%d\n", smp_processor_id());
for (;;) {
overrun = hrtimer_forward_now(timer, cfs_b->period);
if (!overrun)
@@ -6537,6 +6538,7 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
}
if (idle)
cfs_b->period_active = 0;
+ pr_crit("sched_cfs_period_timer: Finished on CPU%d\n", smp_processor_id());
raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
diff --git a/kernel/sys.c b/kernel/sys.c
index c434968e9f5d..d68b05963b88 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2470,6 +2470,79 @@ static int prctl_get_auxv(void __user *addr, unsigned long len)
return sizeof(mm->saved_auxv);
}
+/* These variables will be used in dumb ways. */
+raw_spinlock_t dwdt_spin_lock;
+struct hrtimer dwtd_timer;
+DEFINE_RWLOCK(dwdt_lock);
+
+/* Should send ktimerd into a deadlock */
+static enum hrtimer_restart deadlock_timer(struct hrtimer *timer)
+{
+ pr_crit("deadlock_timer: Started on CPU%d\n", smp_processor_id());
+ /* Should hit rtlock slowpath after kthread writer. */
+ read_lock(&dwdt_lock);
+ read_unlock(&dwdt_lock);
+ pr_crit("deadlock_timer: Finished on CPU%d\n", smp_processor_id());
+ return HRTIMER_NORESTART;
+}
+
+/* kthread function to preempt fair thread and block on write lock. */
+static int grab_dumb_lock(void *data)
+{
+ pr_crit("RT kthread: Started on CPU%d\n", smp_processor_id());
+ write_lock_irq(&dwdt_lock);
+ write_unlock_irq(&dwdt_lock);
+ pr_crit("RT kthread: Finished on CPU%d\n", smp_processor_id());
+
+ return 0;
+}
+
+/* Try to send ktimerd into a deadlock. */
+static void dumb_ways_to_die(unsigned long loops)
+{
+ struct task_struct *kt;
+ unsigned long i;
+ int cpu;
+
+ migrate_disable();
+
+ cpu = smp_processor_id();
+ pr_crit("dumb_ways_to_die: Started on CPU%d with %lu loops\n", cpu, loops);
+
+ raw_spin_lock_init(&dwdt_spin_lock);
+ hrtimer_setup(&dwtd_timer, deadlock_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
+ kt = kthread_create_on_cpu(&grab_dumb_lock, NULL, cpu, "dumb-thread");
+
+ read_lock_irq(&dwdt_lock);
+
+ /* Dummy lock; Disables preemption. */
+ raw_spin_lock(&dwdt_spin_lock);
+
+ pr_crit("dumb_ways_to_die: Queuing timer on CPU%d\n", cpu);
+ /* Start a timer that will run before the bandwidth timer. */
+ hrtimer_forward_now(&dwtd_timer, ns_to_ktime(10000));
+ hrtimer_start_expires(&dwtd_timer, HRTIMER_MODE_ABS_PINNED);
+
+ pr_crit("dumb_ways_to_die: Waking up RT kthread on CPU%d\n", cpu);
+ sched_set_fifo(kt); /* Create a high priority thread. */
+ wake_up_process(kt);
+
+ /* Exhaust bandwidth of caller */
+ for (i = 0; i < loops; ++i)
+ cpu_relax();
+
+ /* Enable preemption; kt should preempt now. */
+ raw_spin_unlock(&dwdt_spin_lock);
+
+ /* Waste time just in case RT task has not preempted us. (very unlikely!) */
+ for (i = 0; i < loops; ++i)
+ cpu_relax();
+
+ read_unlock_irq(&dwdt_lock);
+ pr_crit("dumb_ways_to_die: Finished on CPU%d with %lu loops\n", cpu, loops);
+ migrate_enable();
+}
+
SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
unsigned long, arg4, unsigned long, arg5)
{
@@ -2483,6 +2556,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
error = 0;
switch (option) {
+ case 666:
+ dumb_ways_to_die(arg2);
+ break;
case PR_SET_PDEATHSIG:
if (!valid_signal(arg2)) {
error = -EINVAL;
--

The above adds a prctl() to trigger the scenario whose flow I've
described with some inline comments. Patterns like above is a crime
but I've done it in the name of science. Steps to reproduce:

# mkdir /sys/fs/cgroup/CG0
# echo $$ > /sys/fs/cgroup/CG0/cgroup.procs
# echo "500000 1000000" > /sys/fs/cgroup/CG0/cpu.max

# dmesg | tail -n 2 # Find the CPU where bandwidth timer is running
[ 175.919325] sched_cfs_period_timer: Started on CPU214
[ 175.919330] sched_cfs_period_timer: Finished on CPU214

# taskset -c 214 perl -e 'syscall 157,666,50000000' # Pin perl to same CPU, 50M loops

Note: You have to pin the perl command to the same CPU as the timer for it
to run into stalls. It may take a couple of attempts. Also please adjust
the number of loops of cpu_relax() based on your setup. In my case, 50M
loops runs long enough to exhaust the cfs bandwidth.

With this I see:

sched_cfs_period_timer: Started on CPU214
sched_cfs_period_timer: Finished on CPU214
dumb_ways_to_die: Started on CPU214 with 50000000 loops
dumb_ways_to_die: Queuing timer on CPU214
dumb_ways_to_die: Waking up RT kthread on CPU214
RT kthread: Started on CPU214
deadlock_timer: Started on CPU214
rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: Tasks blocked on level-1 rcu_node (CPUs 208-223): P1975/3:b..l
rcu: (detected by 124, t=15002 jiffies, g=3201, q=138 ncpus=256)
task:ktimers/214 state:D stack:0 pid:1975 tgid:1975 ppid:2 task_flags:0x4208040 flags:0x00004000
Call Trace:
<TASK>
__schedule+0x401/0x15a0
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
? update_rq_clock+0x7c/0x120
? srso_alias_return_thunk+0x5/0xfbef5
? rt_mutex_setprio+0x1c2/0x480
schedule_rtlock+0x1e/0x40
rtlock_slowlock_locked+0x20e/0xc60
rt_read_lock+0x8f/0x190
? __pfx_deadlock_timer+0x10/0x10
deadlock_timer+0x28/0x50
__hrtimer_run_queues+0xfd/0x2e0
hrtimer_run_softirq+0x9d/0xf0
handle_softirqs.constprop.0+0xc1/0x2a0
? __pfx_smpboot_thread_fn+0x10/0x10
run_ktimerd+0x3e/0x80
smpboot_thread_fn+0xf3/0x220
kthread+0xff/0x210
? rt_spin_lock+0x3c/0xc0
? __pfx_kthread+0x10/0x10
ret_from_fork+0x34/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>

I get rcub stall messages after a while (from a separate trace):

INFO: task rcub/4:462 blocked for more than 120 seconds.
Not tainted 6.15.0-rc1-test+ #743
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:rcub/4 state:D stack:0 pid:462 tgid:462 ppid:2 task_flags:0x208040 flags:0x00004000
Call Trace:
<TASK>
__schedule+0x401/0x15a0
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
? rt_mutex_adjust_prio_chain+0xa5/0x7e0
rt_mutex_schedule+0x20/0x40
rt_mutex_slowlock_block.constprop.0+0x42/0x1e0
__rt_mutex_slowlock_locked.constprop.0+0xa7/0x210
rt_mutex_slowlock.constprop.0+0x4e/0xc0
rcu_boost_kthread+0xe3/0x320
? __pfx_rcu_boost_kthread+0x10/0x10
kthread+0xff/0x210
? rt_spin_lock+0x3c/0xc0
? __pfx_kthread+0x10/0x10
ret_from_fork+0x34/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>

I left the program on for good 5 minutes and it did not budge after the
splat.

Note: I could not reproduce the splat with !PREEMPT_RT kernel
(CONFIG_PREEMPT=y) or with small loops counts that don't exhaust the
cfs bandwidth.


Jan


--
Thanks and Regards,
Prateek