Re: UMCG - how should we proceed? Should we?

From: Peter Oskolkov
Date: Thu Apr 06 2023 - 13:19:39 EST


On Thu, Apr 6, 2023 at 3:38 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Tue, Mar 28, 2023 at 02:07:54PM -0700, Peter Oskolkov wrote:
> > Hi Peter!
> >
> > TL;DR: which approach, if any, should a UMCG implementation in the mainline kernel use?
> >
> > Details:
> >
> > We are rolling out internally a UMCG implementation copied below (with some
> > boilerplate omitted), so I would like to restart our discussion on the topic.
> >
> > The implementation below is different from what we had earlier
> > (https://lore.kernel.org/lkml/20220120155517.066795336@xxxxxxxxxxxxx/)
> > in that it keeps UMCG state in the kernel rather than TLS.
> >
> > While having UMCG state in TLS is _much_ better, as it makes state synchronization
> > between the userspace and the kernel much simpler, the whole page pinning
> > machinery in the link above looked very scary, honestly.
> >
> > So if we are going to ever have something like UMCG in the mainline kernel, we need
> > to figure out the approach to use: the TLS-based one, something similar
> > to what we have now internally (details below), or something else. Or none at all...
> >
> > While I would very much prefer to have it done your way (state in TLS), the page pinning
> > business was too much for me. If you can figure out a way to do it cleanly and reliably, great!
>
> A few quick notes without having looked at the patch...
>
> > The main differences between what you had in the TLS patchset and what is below:
>
> (note that in the end the per-task UMCG info thing didn't *need* to be
> TLS, although it is a logical place to put it)

Yes, of course. By "TLS" here I mean in userspace per task. Just
easier to type "TLS" than "a per-task userspace area similar to
rseq"...

>
> > - per worker/server state not in TLS but in task_struct
> > - we keep a list of idle workers and a list of idle servers in mm
>
> How much of a scalability fail is that? Mathieu and me are currently
> poking at a rseq/cid regression due to large multi thread contention on
> mm data.

Our main use case is having a small number of servers and a single
cross-server queue/scheduler in the userspace, not per-server
queues/schedulers, so doing a couple of instructions (adding tasks to
an SLL) under a spinlock does not seem to be an issue; if it becomes
an issue, we can always switch to lock-free SLLs.

>
> But yeah, I think this was one of the open issues we still had; with the
> other implementation -- I seem to have a half finished patch for an
> idle_server list.
>
> > - worker wake events are delivered not to servers which ran the workers earlier,
> > but to idle servers from the idle server list
>
> Provided there is one I take it; very easy to run out of idle things.
> Also, what if you want to explicitly manage placement, can you still
> direct the wakeup?

As I mentioned above, we don't have per-server queues/schedulers, so
we didn't need to direct wakeups to specific servers. Again, our model
is if we have M servers and N workers, and M workers are running (i.e.
no idle servers), waking a server when a blocked worker wakes means
either preempting a running worker or grabbing an additional CPU; none
of these options fit our model well.

Doing it in a more flexible and scalable way to accommodate per-server
queues/scheduling and RT scheduling would be great, of course, but I
suspect the implementation will be more complex; and we definitely
would like to stick to the principle "userspace cannot have more
running tasks/threads than there are servers" (background stuff
excluded, of course; but scheduling code is very much "foreground").

>
> > - worker preemption happens not via a syscall (umcg_kick) but by hooking
> > into sched_tick
>
> sched_tick would render it somewhat unsuitable for RT
> workloads/schedulers where you might need more immediate preemption.

Yes; on the other hand, having preemption only via a syscall
(umcg_kick) means the userspace should be tracking all running
workers, juggling timers, etc.; and the same unresolved question of
needing extra CPUs do to all this (who will kick workers from the
userspace if they occupy all allocated CPUs?); using sched_tick is
much simpler and does not require extra concurrency/cpus. Maybe we can
have it both ways: an explicit umcg_kick() for RT use cases and
"regular" tick-based preemption?

>
> > None of the differences above are deal breakers; again, if the TLS/page pinning
> > approach is viable, we will gladly use it.
>
> Urgh, so yeah.. I meant to go look at the whole UMCG thing again with an
> eye specifically at inter-process support.
>
> I'm hoping inter-process UMCG can be used to implement custom libpthread
> that would allow running most of userspace under a custom UMCG scheduler
> and obviate the need for this horrible piece of shit eBPF sched thing.
>
> But I keep getting side-tracked with other stuff :/ I'll try and bump
> this stuff up the todo list.

In-process vs cross-process userspace scheduling are two
different/distinct use cases for us, and doing them separately means
they can be worked on by different people in parallel (i.e. done
faster). Can this all be done as one thing that addresses both use
cases? Probably yes; the question is how long will it take?