Re: [PATCH v2 1/1] futex: Add FUTEX_SPIN operation

From: Mateusz Guzik
Date: Sat May 25 2024 - 05:37:02 EST


On Thu, May 23, 2024 at 05:07:04PM -0300, André Almeida wrote:
> Add a new mode for futex wait, the futex spin.
>

I'll note that currently individuals who want to reduce performance
degradation due to contention in userspace either handroll some
speculative spinning + futex usage or straight up only spin, which btw
is a massive pet peeve of mine.

With this in mind the real reference point justifying this mechanism (or
showing it not being good enough to warrant inclusion) is a case where
all participating threads are on cpu and spinning is done in userspace
(in a sensible manner). Of course there is more than one way to spin,
but that is a minor point in the grand scheme of things.

The benchmark you linked in the cover letter definitely would use some
improvement, it was mentioned elsewhere that some of the time is spent
just waiting for threads to be created. The usual way to handle this is
by parking everyone on a barrier before the measurement starts. You can
use will-it-scale as an example how to do it, or better yet plug your
stuff into it as testcases instead. It even already has some contested
pthread mutex benchmarks.

So I would say proper test matrix would include pthread mutexes as is,
FUTEX2_SPIN and mere spinning (just load + cpu_relax, I can ship you
with sample code for will-it-scale if you want).

Even the best possible variant of FUTEX2_SPIN has to do worse than mere
spinning in userspace -- in a HT setup it hinders execution by coming
here instead of just cpu_relax-ing, and that aside it may be introducing
dead time where the lock is free to use by the cpu is busy going in and
out of the kernel. The question is if it is going to do well enough to
persuade people to not bother with their own hacks, which it plausibly
will.

> Given the FUTEX2_SPIN flag, parse the futex value as the TID of the lock
> owner. Then, before going to the normal wait path, spins while the lock
> owner is running in a different CPU, to avoid the whole context switch
> operation and to quickly return to userspace. If the lock owner is not
> running, just sleep as the normal futex wait path.
>

Syscall costs are so high that by the time you get to this code chances
are decent the owner already released the lock and it is now being held
by someone else, also see below.

> +static int futex_spin(struct futex_hash_bucket *hb, struct futex_q *q,
> + struct hrtimer_sleeper *timeout, u32 uval)
> +{
> + struct task_struct *p;
> + pid_t pid = uval & FUTEX_TID_MASK;
> +
> + p = find_get_task_by_vpid(pid);
> +

In order to even get here all prospective spinners have to serialize on
a spinlock. Then they further dirty a process by grabbing a ref on it.
This seriously hinders the mechanism but is avoidable.

> + /* no task found, maybe it already exited */
> + if (!p) {
> + futex_q_unlock(hb);
> + return -EAGAIN;
> + }
> +
> + /* can't spin in a kernel task */
> + if (unlikely(p->flags & PF_KTHREAD)) {
> + put_task_struct(p);
> + futex_q_unlock(hb);
> + return -EPERM;
> + }
> +
> + futex_queue(q, hb);
> +
> + if (timeout)
> + hrtimer_sleeper_start_expires(timeout, HRTIMER_MODE_ABS);
> +
> + while (1) {
> + if (likely(!plist_node_empty(&q->list))) {
> + if (timeout && !timeout->task)
> + goto exit;
> +
> + if (task_on_cpu(p)) {
> + /* spin */

cpu_relax() is the thing to do here, it was already mentioned by someone
else.

My comment is that as mentioned above the lock owner by now may be
different. So to my reading this spins on the lock as long as p is
running, which may not even be holding it at the moment. Worse, by now
the p task may also be spinning in this very place.

That is to say I don't believe passing TID as an argument to the
syscall is a viable approach.

Instead this code will have to read the futex value on its own every
time (with a special accessor which does not allow for page faults) and
do the find_get_task_by_vpid + task_on_cpu combo under RCU (no refing of
the proc).

The queue locking would also be dodged, except for the case where the
code decides to go off cpu.

Short of something like that imho this has no hopes of competing with
userspace spinning.