AW: Possible PREEMPT_RT live-lock / priority-inversion between FUTEX_CMP_REQUEUE_PI and FUTEX_WAIT_REQUEUE_PI
From: Moritz KLAMMLER (FERCHAU)
Date: Mon Mar 23 2026 - 12:40:56 EST
Thanks for your quick and helpful response, Sebastian.
We have tried your patch and it indeed seems to solve the problem (using
the example program from my previous message as a test case). It also
certainly is more elegant than any of the other options we've considered
so far. Thank you very much. I'll report back if our system test
should find any unexpected regression in the following days.
>> we're running Linux 6.6 with PREEMPT_RT on a single-core armv7l machine
> v6.6.109+?
Yes, 6.6.122 to be precise.
I've also compared the logic with newer kernel versions, but couldn't
identify any differences that seemed significant to me, with respect to
the logic in question. I have to admit that I didn't actually /run/ the
test with any newer kernels, though.
> So the syscall, that saw Q_REQUEUE_PI_IGNORE, returned and now a second
> requeue-PI is attempted?
I /think/ that it's already the first syscall seeing the
Q_REQUEUE_PI_IGNORE that gets locked up.
Please excuse some bad copy pasta in my first message where at least
once I wrote Q_REQUEUE_PI_DONE instead of Q_REQUEUE_PI_IGNORE. Sorry
for any confusion this might have caused.
________________________________________
Von: Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx>
Gesendet: Montag, 23. März 2026 14:30
An: Moritz KLAMMLER (FERCHAU)
Cc: Thomas Gleixner; Peter Zijlstra; linux-rt-devel@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx
Betreff: Re: Possible PREEMPT_RT live-lock / priority-inversion between FUTEX_CMP_REQUEUE_PI and FUTEX_WAIT_REQUEUE_PI
On 2026-03-20 19:23:01 [+0000], Moritz KLAMMLER (FERCHAU) wrote:
> Hello,
Hi,
> we're running Linux 6.6 with PREEMPT_RT on a single-core armv7l machine
v6.6.109+?
> and observed our devices getting locked-up every few days. We're using
> RT/PI condition variables from librtpi [1] and determined that the RT
> (SCHED_FIFO) thread making the FUTEX_CMP_REQUEUE_PI syscall from within
> pi_cond_broadcast seems to occasionally live-lock inside the kernel.
>
> Thanks to a possibly less than ideal design decision in our system, the
> "producer" thread calling pi_cond_broadcast (i.e. doing the
> FUTEX_CMP_REQUEUE_PI) has a higher priority than the "consumer" threads
> that are waiting on the condition variable (calling pi_cond_timedwait
> which eventually makes a FUTEX_WAIT_REQUEUE_PI call). While this might
> not be ideal, I suppose that it still ought to be allowed; please
> correct me if I should be mistaken on that point.
Not sure why not. Worst case would be that the producer would snap all
locks and see no waiter because the consumer never managed to enqueue.
> What seems to happen next is that when the waiter exceeds its finite
> timeout [2] and half an eye-blink later, the producer thread decides to
The alternative to timeout is signal.
> call FUTEX_CMP_REQUEUE_PI after all, the lower-priority consumer might
> make it to the point where it sets the requeue state to
> Q_REQUEUE_PI_DONE in futex_requeue_pi_wakeup_sync but then gets
> preempted before it has a chance to remove itself from the waiters list.
> Now, the higher-priority producer thread calls futex_requeue_pi_prepare
> which will return false because it sees the Q_REQUEUE_PI_IGNORE.
> Subsequently, futex_proxy_trylock_atomic will fail with -EAGAIN and
So the syscall, that saw Q_REQUEUE_PI_IGNORE, returned and now a second
requeue-PI is attempted?
> futex_requeue "goto retry". Which effectively results in the
> higher-priority RT thread busy-waiting on the lower-priority thread
> forever. It will call cond_resched before the "goto retry" but since it
> is considered the most important task in the system, it doesn't seem to
> be scheduled away anymore.
Yup. Kind of obvious if you put it like this.
What about
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index 7e43839ca7b05..ce02cc715c98d 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -307,8 +307,11 @@ futex_proxy_trylock_atomic(u32 __user *pifutex, struct futex_hash_bucket *hb1,
return -EINVAL;
/* Ensure that this does not race against an early wakeup */
- if (!futex_requeue_pi_prepare(top_waiter, NULL))
+ if (!futex_requeue_pi_prepare(top_waiter, NULL)) {
+ plist_del(&top_waiter->list, &hb1->chain);
+ futex_hb_waiters_dec(hb1);
return -EAGAIN;
+ }
/*
* Try to take the lock for top_waiter and set the FUTEX_WAITERS bit
@@ -709,8 +712,10 @@ int handle_early_requeue_pi_wakeup(struct futex_hash_bucket *hb,
* We were woken prior to requeue by a timeout or a signal.
* Unqueue the futex_q and determine which it was.
*/
- plist_del(&q->list, &hb->chain);
- futex_hb_waiters_dec(hb);
+ if (!plist_node_empty(&q->list)) {
+ plist_del(&q->list, &hb->chain);
+ futex_hb_waiters_dec(hb);
+ }
/* Handle spurious wakeups gracefully */
ret = -EWOULDBLOCK;
? It compiles and might work.
Sebastian