Re: [RFC PATCH 0/3 v3] futex/sched: introduce FUTEX_SWAP operation

From: Peter Oskolkov
Date: Wed Mar 17 2021 - 14:44:28 EST


Hi Jim, thank you for your interest!

While FUTEX_SWAP seems to be a nonstarter, there is a discussion
off-list on how to approach the larger problem of userspace
scheduling. A full userspace scheduling patchset is likely to take
some time to shape out, but the "core" patches of wait/wake/swap are
more or less ready, so I'll probably post an early RFC version here in
the next week or two.

CC-ing the maintainers.

Thanks,
Peter

On Wed, Mar 17, 2021 at 10:59 AM Jim Newsome <jnewsome@xxxxxxxxxxxxxx> wrote:
>
> I'm not well versed in this part of the kernel (ok, any part, really),
> but I wanted to chime in from a user perspective that I'm very
> interested in this functionality.
>
> We (Rob + Ryan + I, cc'd) are currently developing the second generation
> of the Shadow simulator <https://shadow.github.io/>, which is used by
> various researchers and the Tor Project. In this new architecture,
> simulated network-application processes (such as tor, browsers, and web
> servers) are each run as a native OS process, started by forking and
> exec'ing its unmodified binary. We are interested in supporting large
> simulations (e.g. 50k+ processes), and expect them to take on the order
> of hours or even days to execute, so scalability and performance matters.
>
> We've prototyped two mechanisms for controlling these simulated
> processes, and a third hybrid mechanism that combines the two. I've
> mentioned one of these (ptrace) in another thread ("do_wait: make
> PIDTYPE_PID case O(1) instead of O(n)"). The other mechanism is to use
> an LD_PRELOAD'd shim that implements the libc interface, and
> communicates with Shadow via a syscall-like API over IPC.
>
> So far the most performant version we've tried of this IPC is with a bit
> of shared memory and a pair of semaphores. It looks much like the
> example in Peter's proposal:
>
> > a. T1: futex-wake T2, futex-wait
> > b. T2: wakes, does what it has been woken to do
> > c. T2: futex-wake T1, futex-wait
>
> We've been able to get the switching costs down using CPU pinning and
> SCHED_FIFO. Each physical CPU spends most of its time swapping back and
> forth between a Shadow worker thread and an emulated process. Even so,
> the new architecture is so far slower than the first generation of
> Shadow, which multiplexes the simulated processes into its own handful
> of OS processes (but is complex and fragile).
>
> > With FUTEX_SWAP, steps a and c above can be reduced to one futex
> > operation that runs 5-10 times faster.
>
> IIUC the proposed primitives could let us further improve performance,
> and perhaps drop some of the complexity of attempting to control the
> scheduler via pinning and SCHED_FIFO.