Re: [RFC PATCH 3/3 v0.2] sched/umcg: RFC: implement UMCG syscalls

From: Peter Oskolkov
Date: Mon Jul 12 2021 - 19:31:30 EST


On Mon, Jul 12, 2021 at 2:44 PM Thierry Delisle <tdelisle@xxxxxxxxxxxx> wrote:
>
> > sys_umcg_wait without next_tid puts the task in UMCG_IDLE state; wake
> > wakes it. These are standard sched operations. If they are emulated
> > via futexes, fast context switching will require something like
> > FUTEX_SWAP that was NACKed last year.
>
> I understand these wait and wake semantics and the need for the fast
> context-switch(swap). As I see it, you need 3 operations:
>
> - SWAP: context-switch directly to a different thread, no scheduler involved
> - WAIT: block current thread, go back to server thread
> - WAKE: unblock target thread, add it to scheduler, e.g. through
> idle_workers_ptr
>
> There is no existing syscalls to handle SWAP, so I agree sys_umcg_wait is
> needed for this to work.
>
> However, there already exists sys_futex to handle WAIT and WAKE. When a
> worker
> calls either sys_futex WAIT or sys_umcg_wait next_tid == NULL, in both case
> the worker will block, SWAP to the server and wait for FUTEX_WAKE,
> UMCG_WAIT_WAKE_ONLY respectively. It's not obvious to me that there
> would be
> performance difference and the semantics seem to be the same to me.
>
> So what I am asking is: is UMCG_WAIT_WAKE_ONLY needed?

Because the approach you described has been tried last year and was NACKed:
https://lore.kernel.org/lkml/20200722234538.166697-1-posk@xxxxxxx/

In short, futex maintainers do not want to touch the existing futex
code at all other than for bugfixes. No new futex functionality,
period. See e.g. futex2 efforts:
https://lore.kernel.org/lkml/20210603195924.361327-1-andrealmeid@xxxxxxxxxxxxx/

> Is the idea to support workers directly context-switching among each other,
> without involving server threads and without going through idle_servers_ptr?

Yes.

> If so, can you explain some of the intended state transitions in this case.

Cooperative scheduling: workers can yield to each other (i.e. swap).
This allows building very fast concurrency primitives (mutexes,
producer-consumer queues, etc.). For example: a producer-consumer
queue: a producer: prepare an item, contex-switch to a consumer on CPU
synchronously: this can be done much faster than waking the consumer
asynchronously; helps a lot to reduce latency (throughput? not so
much)

>
>
> > > However, I do not understand how the userspace is expected to use
> it. I also
> > > do not understand if these link fields form a stack or a queue and
> where is
> > > the head.
> >
> > When a server has nothing to do (no work to run), it is put into IDLE
> > state and added to the list. The kernel wakes an IDLE server if a
> > blocked worker unblocks.
>
> From the code in umcg_wq_worker_running (Step 3), I am guessing users are
> expected to provide a global head somewhere in memory and
> umcg_task.idle_servers_ptr points to the head of the list for all workers.
> Servers are then added in user space using atomic_stack_push_user. Is this
> correct? I did not find any documentation on the list head.

This is going to change somewhat. I'll post a new patchset soon-ish
(later this week?)

>
> I like the idea that each worker thread points to a given list, it
> allows the
> possibility for separate containers with their own independent servers,
> workers
> and scheduling. However, it seems that the list itself could be implemented
> using existing kernel APIs, for example a futex or an event fd. Like so:
>
> struct umcg_task {
> [...]
>
> /**
> * @idle_futex_ptr: pointer to a futex user for idle server threads.
> *
> * When waking a worker, the kernel decrements the pointed to
> futex value
> * if it is non-zero and wakes a server if the decrement occurred.
> *
> * Server threads that have no work to do should increment the futex
> * value and FUTEX_WAIT
> */
> uint64_t idle_futex_ptr; /* r/w */
>
> [...]
> } __attribute__((packed, aligned(8 * sizeof(__u64))));
>
> I believe the futex approach, like the list, has the advantage that when
> there
> are no idle servers, checking the list requires no locking. I don't know if
> that can be achieved with eventfd.

As mentioned above, a futex-based solution was rejected by
maintainers. Believe me, I tried. Not only tried, we have a working
userspace scheduling stack on top of FUTEX_SWAP deployed internally at
Google, with some actual usage (mostly testing/debugging workloads). I
suggest we stop discussing futexes in this context - I do not see the
maintainers changing their position. And the approach used in this
patchset seems to work (and I actually like how the code is shaping
up).