Futex hash_bucket lock can break isolation and cause priority inversion on RT

From: Juri Lelli
Date: Tue Oct 08 2024 - 11:23:09 EST


Hello,

A report concerning latency sensitive applications using futexes on a
PREEMPT_RT kernel brought me to (try to!) refresh my understanding of
how futexes are implemented. The following is an attempt to make sense
of what I am seeing from traces, validate that it indeed might make
sense and possibly collect ideas on how to address the issue at hand.

Simplifying what is actually a quite complicated setup composed of
non-realtime (i.e., background load mostly related to a containers
orchestrator) and realtime tasks, we can consider the following
situation:

- Multiprocessor system running a PREEMPT_RT kernel
- Housekeeping CPUs (usually 2) running background tasks + “isolated”
CPUs running latency sensitive tasks (possibly need to run also
non-realtime activities at times)
- CPUs are isolated dynamically by using nohz_full/rcu_nocbs options
and affinity, no static scheduler isolation is used (i.e., no
isolcpus=domain)
- Threaded IRQs, RCU related kthreads, timers, etc. are configured with
the highest priorities on the system (FIFO)
- Latency sensitive application threads run at FIFO priority below the
set of tasks from the former point
- Latency sensitive application uses futexes, but they protect data
only shared among tasks running on the isolated set of CPUs
- Tasks running on housekeeping CPUs also use futexes
- Futexes belonging to the above two sets of non interacting tasks are
distinct

Under these conditions the actual issue presents itself when:

- A background task on a housekeeping CPUs enters sys_futex syscall and
locks a hb->lock (PI enabled mutex on RT)
- That background task gets preempted by a higher priority task (e.g.
NIC irq thread)
- A low latency application task on an isolated CPU also enters
sys_futex, hash collision towards the background task hb, tries to
grab hb->lock and, even if it boosts the background task, it still
needs to wait for the higher priority task (NIC irq) to finish
executing on the housekeeping CPU and eventually misses its deadline

Now, of course by making the latency sensitive application tasks use a
higher priority than anything on housekeeping CPUs we could avoid the
issue, but the fact that an implicit in-kernel link between otherwise
unrelated tasks might cause priority inversion is probably not ideal?
Thus this email.

Does this report make any sense? If it does, has this issue ever been
reported and possibly discussed? I guess it’s kind of a corner case, but
I wonder if anybody has suggestions already on how to possibly try to
tackle it from a kernel perspective.

Thanks!
Juri