Re: UMCG - how should we proceed? Should we?

From: Peter Zijlstra
Date: Thu Apr 06 2023 - 06:38:53 EST


On Tue, Mar 28, 2023 at 02:07:54PM -0700, Peter Oskolkov wrote:
> Hi Peter!
>
> TL;DR: which approach, if any, should a UMCG implementation in the mainline kernel use?
>
> Details:
>
> We are rolling out internally a UMCG implementation copied below (with some
> boilerplate omitted), so I would like to restart our discussion on the topic.
>
> The implementation below is different from what we had earlier
> (https://lore.kernel.org/lkml/20220120155517.066795336@xxxxxxxxxxxxx/)
> in that it keeps UMCG state in the kernel rather than TLS.
>
> While having UMCG state in TLS is _much_ better, as it makes state synchronization
> between the userspace and the kernel much simpler, the whole page pinning
> machinery in the link above looked very scary, honestly.
>
> So if we are going to ever have something like UMCG in the mainline kernel, we need
> to figure out the approach to use: the TLS-based one, something similar
> to what we have now internally (details below), or something else. Or none at all...
>
> While I would very much prefer to have it done your way (state in TLS), the page pinning
> business was too much for me. If you can figure out a way to do it cleanly and reliably, great!

A few quick notes without having looked at the patch...

> The main differences between what you had in the TLS patchset and what is below:

(note that in the end the per-task UMCG info thing didn't *need* to be
TLS, although it is a logical place to put it)

> - per worker/server state not in TLS but in task_struct
> - we keep a list of idle workers and a list of idle servers in mm

How much of a scalability fail is that? Mathieu and me are currently
poking at a rseq/cid regression due to large multi thread contention on
mm data.

But yeah, I think this was one of the open issues we still had; with the
other implementation -- I seem to have a half finished patch for an
idle_server list.

> - worker wake events are delivered not to servers which ran the workers earlier,
> but to idle servers from the idle server list

Provided there is one I take it; very easy to run out of idle things.
Also, what if you want to explicitly manage placement, can you still
direct the wakeup?

> - worker preemption happens not via a syscall (umcg_kick) but by hooking
> into sched_tick

sched_tick would render it somewhat unsuitable for RT
workloads/schedulers where you might need more immediate preemption.

> None of the differences above are deal breakers; again, if the TLS/page pinning
> approach is viable, we will gladly use it.

Urgh, so yeah.. I meant to go look at the whole UMCG thing again with an
eye specifically at inter-process support.

I'm hoping inter-process UMCG can be used to implement custom libpthread
that would allow running most of userspace under a custom UMCG scheduler
and obviate the need for this horrible piece of shit eBPF sched thing.

But I keep getting side-tracked with other stuff :/ I'll try and bump
this stuff up the todo list.