Re: [RFC PATCH 0/3 v3] futex/sched: introduce FUTEX_SWAP operation
From: Jim Newsome
Date: Wed Mar 17 2021 - 13:58:58 EST
I'm not well versed in this part of the kernel (ok, any part, really),
but I wanted to chime in from a user perspective that I'm very
interested in this functionality.
We (Rob + Ryan + I, cc'd) are currently developing the second generation
of the Shadow simulator <https://shadow.github.io/>, which is used by
various researchers and the Tor Project. In this new architecture,
simulated network-application processes (such as tor, browsers, and web
servers) are each run as a native OS process, started by forking and
exec'ing its unmodified binary. We are interested in supporting large
simulations (e.g. 50k+ processes), and expect them to take on the order
of hours or even days to execute, so scalability and performance matters.
We've prototyped two mechanisms for controlling these simulated
processes, and a third hybrid mechanism that combines the two. I've
mentioned one of these (ptrace) in another thread ("do_wait: make
PIDTYPE_PID case O(1) instead of O(n)"). The other mechanism is to use
an LD_PRELOAD'd shim that implements the libc interface, and
communicates with Shadow via a syscall-like API over IPC.
So far the most performant version we've tried of this IPC is with a bit
of shared memory and a pair of semaphores. It looks much like the
example in Peter's proposal:
> a. T1: futex-wake T2, futex-wait
> b. T2: wakes, does what it has been woken to do
> c. T2: futex-wake T1, futex-wait
We've been able to get the switching costs down using CPU pinning and
SCHED_FIFO. Each physical CPU spends most of its time swapping back and
forth between a Shadow worker thread and an emulated process. Even so,
the new architecture is so far slower than the first generation of
Shadow, which multiplexes the simulated processes into its own handful
of OS processes (but is complex and fragile).
> With FUTEX_SWAP, steps a and c above can be reduced to one futex
> operation that runs 5-10 times faster.
IIUC the proposed primitives could let us further improve performance,
and perhaps drop some of the complexity of attempting to control the
scheduler via pinning and SCHED_FIFO.