[PATCH for 5.9 v2 1/4] futex: introduce FUTEX_SWAP operation

From: Peter Oskolkov
Date: Mon Aug 03 2020 - 18:15:29 EST

From: Peter Oskolkov <posk@xxxxxxxxxx>

As Paul Turner presented at LPC in 2003, Google has developed
an M:N userspace threading subsystem backed by Google-private
SwitchTo Linux Kernel API. This subsystem provides latency-sensitive
services at Google withfine-grained user-space control/scheduling
over what is running when, and this subsystem is used widely
internally (called schedulers or fibers).

A simplified/idealized use case: imagine a multi-user service application
(e.g. a DBMS) that has to implement the following user CPU quota
- each user (these are DBMS users, not Linux users) can purchase
certain amounts of expensive, but guaranteed, low-latency
CPU quota (as a % of total CPUs available to the service), and a certain
amount of cheap high-latency CPU quota;
- quotas are enforced per second;
- each user RPC/request to the service can specify whether this is
a latency-critical request that should use the user's low-latency quota,
and be immediately rejected if the quota for this second is exhausted;
- requests can be high-latency-tolerant: should only use the high-latency
- a request can also be latency-tolerant: it should use the low-latency
quota if available, or the high-latency quota if the low-latency quota
is exhausted;
- all "sold" (= allocated) low-latency quotas can go up to, but not exceed,
70% of all available CPUs (i.e. no over-subscription);
- high-latency quotas are oversubscribed;
- user isolation: misbehaving users should not affect the serving
latency of users with available low-latency quotas;
- soft deadlines/timeouts: each request/RPC can specify that it must
be served within a certain deadline (let's say the minimum deadline
is 10 milliseconds) or dropped if the deadline is exceeded;
- each user request can potentially spawn several processing threads/tasks,
and do child RPCs to remote services; these threads/tasks should
also be covered by this quota/policy;
- user requests should be served somewhat in-order: requests that use
the same quota tiers that arrive earlier should be granted CPU before
requests that arrive later ("arrival order scheduling").

There are many services at Google that implement a variant of the scheduling
policy outlined above. In reality there are more priorities/quota tiers,
there is service-internal maintenance work that can be either high
or low priority, sometimes FIFO/LIFO/round robin scheduling is used in
addition to arrival order scheduling, etc. (for example, LIFO scheduling
is better at cache-locality in certain scenarios). User isolation within
a process, as well as latency improvements are the main benefits (on top
of the actual ability to implement complex quota/scheduling policies).

What is important is that these scheduling policies are driven by
userspace schedulers built on top of these basic kernel primitives:
- block: block the current thread/task (with a timeout);
- resume: resume some previously blocked task (should commutate
with block, i.e. racing block/resume pairs should behave
exactly as if wake arrived after block);
- switch_to: block the current thread, resume some previously
blocked task (behaves exactly as wake(remote), block(current), but
optimized to do a fast context switch on the fast path);
- block detection: when a task blocks in the kernel (on a network
read, for example), the userspace scheduler is notified and
schedules (resumes or swaps into) a pending task in the newly available
CPU slot;
- wake detection: when a task wakes from a previously blocking kernel
operation (e.g. can now process some data on a network socket), the
userspace scheduler is notified and can now schedule the task to
run on a CPU when a CPU is available and the task can use it according
to its scheduling policy.

(Technically, block/wake detection is still experimental and not
used widely: as we control the userspace, we can actually determine
blocking/waking syscalls without kernel support).

Internally we currently use kernel patches that are too "intrusive" to be
included in a general-purpose Linux kernel, so we are exploring ways to
upstream this functionality.

The easiest/least intrusive approach that we have come up with is this:

- block/resume map perfectly to futex wait/wake;
- switch_to thus maps to FUTEX_SWAP;
- block and wake detection can be done either through tracing
or by introducing new BPF attach points (when a task blocks or wakes,
a BPF program is triggered that then communicates with the userspace);
- the BPF attach points are per task, and the task needs to "opt in"
(i.e. all other tasks suffer just an additional pointer comparison
on block/wake);
- the BPF programs triggered on block/wake should be able to perform
futex ops (e.g. wake a designated userspace scheduling task) - this
probably indicates that tracing is not enough, and a new BPF prog type
is needed.

In addition to the above, another common use case for FUTEX_SWAP is
message passing a-la RPC between tasks: task/thread T1 prepares a message,
wakes T2 to work on it, and waits for the results; when T2 is done, it
wakes T1 and waits for more work to arrive. Currently the simplest
way to implement this is

a. T1: futex-wake T2, futex-wait
b. T2: wakes, does what it has been woken to do
c. T2: futex-wake T1, futex-wait

With FUTEX_SWAP, steps a and c above can be reduced to one futex operation
that runs 5-10 times faster.

Patches in this patchset:

Patch 1: (this patch) add FUTEX_SWAP #defines, as well as the overall
reasoning behind the patchset.
Patch 2: implement FUTEX_SWAP futex operation that, internally, does
wake + wait.
Patch 3: the main speed-up of FUTEX_SWAP: migrate the wakee to the
waker's CPU.
Patch 4: a selftest that can also be used to benchmark FUTEX_SWAP vs

Tested: see patch 4 in this patchset.

Signed-off-by: Peter Oskolkov <posk@xxxxxxxxxx>
include/uapi/linux/futex.h | 2 ++
1 file changed, 2 insertions(+)

diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h
index a89eb0accd5e..c1d151d97dea 100644
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -21,6 +21,7 @@
+#define FUTEX_SWAP 13

@@ -40,6 +41,7 @@

* Support for robust futexes: the kernel cleans up held futexes at