Re: [RFC PATCH 3/3 v0.2] sched/umcg: RFC: implement UMCG syscalls

From: Thierry Delisle
Date: Mon Jul 12 2021 - 17:44:25 EST


> sys_umcg_wait without next_tid puts the task in UMCG_IDLE state; wake
> wakes it. These are standard sched operations. If they are emulated
> via futexes, fast context switching will require something like
> FUTEX_SWAP that was NACKed last year.

I understand these wait and wake semantics and the need for the fast
context-switch(swap). As I see it, you need 3 operations:

- SWAP: context-switch directly to a different thread, no scheduler involved
- WAIT: block current thread, go back to server thread
- WAKE: unblock target thread, add it to scheduler, e.g. through
        idle_workers_ptr

There is no existing syscalls to handle SWAP, so I agree sys_umcg_wait is
needed for this to work.

However, there already exists sys_futex to handle WAIT and WAKE. When a worker
calls either sys_futex WAIT or sys_umcg_wait next_tid == NULL, in both case
the worker will block, SWAP to the server and wait for FUTEX_WAKE,
UMCG_WAIT_WAKE_ONLY respectively. It's not obvious to me that there would be
performance difference and the semantics seem to be the same to me.

So what I am asking is: is UMCG_WAIT_WAKE_ONLY needed?

Is the idea to support workers directly context-switching among each other,
without involving server threads and without going through idle_servers_ptr?

If so, can you explain some of the intended state transitions in this case.


> > However, I do not understand how the userspace is expected to use it. I also
> > do not understand if these link fields form a stack or a queue and where is
> > the head.
>
> When a server has nothing to do (no work to run), it is put into IDLE
> state and added to the list. The kernel wakes an IDLE server if a
> blocked worker unblocks.

From the code in umcg_wq_worker_running (Step 3), I am guessing users are
expected to provide a global head somewhere in memory and
umcg_task.idle_servers_ptr points to the head of the list for all workers.
Servers are then added in user space using atomic_stack_push_user. Is this
correct? I did not find any documentation on the list head.

I like the idea that each worker thread points to a given list, it allows the
possibility for separate containers with their own independent servers, workers
and scheduling. However, it seems that the list itself could be implemented
using existing kernel APIs, for example a futex or an event fd. Like so:

struct umcg_task {
     [...]

     /**
      * @idle_futex_ptr: pointer to a futex user for idle server threads.
      *
      * When waking a worker, the kernel decrements the pointed to futex value
      * if it is non-zero and wakes a server if the decrement occurred.
      *
      * Server threads that have no work to do should increment the futex
      * value and FUTEX_WAIT
      */
     uint64_t    idle_futex_ptr;    /* r/w */

     [...]
} __attribute__((packed, aligned(8 * sizeof(__u64))));

I believe the futex approach, like the list, has the advantage that when there
are no idle servers, checking the list requires no locking. I don't know if
that can be achieved with eventfd.