Re: 'simple' futex interface [Was: [PATCH v3 1/4] futex: Implement mechanism to wait on any of several futexes]

From: Florian Weimer
Date: Tue Mar 03 2020 - 08:00:38 EST


* Peter Zijlstra:

> So how about we introduce new syscalls:
>
> sys_futex_wait(void *uaddr, unsigned long val, unsigned long flags, ktime_t *timo);
>
> struct futex_wait {
> void *uaddr;
> unsigned long val;
> unsigned long flags;
> };
> sys_futex_waitv(struct futex_wait *waiters, unsigned int nr_waiters,
> unsigned long flags, ktime_t *timo);
>
> sys_futex_wake(void *uaddr, unsigned int nr, unsigned long flags);
>
> sys_futex_cmp_requeue(void *uaddr1, void *uaddr2, unsigned int nr_wake,
> unsigned int nr_requeue, unsigned long cmpval, unsigned long flags);
>
> Where flags:
>
> - has 2 bits for size: 8,16,32,64
> - has 2 more bits for size (requeue) ??
> - has ... bits for clocks
> - has private/shared
> - has numa

What's the actual type of *uaddr? Does it vary by size (which I assume
is in bits?)? Are there alignment constraints?

These system calls seemed to be type-polymorphic still, which is
problematic for defining a really nice C interface. I would really like
to have a strongly typed interface for this, with a nice struct futex
wrapper type (even if it means that we need four of them).

Will all architectures support all sizes? If not, how do we probe which
size/flags combinations are supported?

> For NUMA I propose that when NUMA_FLAG is set, uaddr-4 will be 'int
> node_id', with the following semantics:
>
> - on WAIT, node_id is read and when 0 <= node_id <= nr_nodes, is
> directly used to index into per-node hash-tables. When -1, it is
> replaced by the current node_id and an smp_mb() is issued before we
> load and compare the @uaddr.
>
> - on WAKE/REQUEUE, it is an immediate index.

Does this mean the first waiter determines the NUMA index, and all
future waiters use the same chain even if they are on different nodes?

I think documenting this as a node index would be a mistake. It could
be an arbitrary hint for locating the corresponding kernel data
structures.

> Any invalid value with result in EINVAL.

Using uaddr-4 is slightly tricky with a 64-bit futex value, due to the
need to maintain alignment and avoid padding.

Thanks,
Florian