Re: [RFC PATCH 2/3 v0.2] sched/umcg: RFC: add userspace atomic helpers

From: Peter Zijlstra
Date: Tue Jul 13 2021 - 12:10:44 EST


On Fri, Jul 09, 2021 at 09:57:32AM -0700, Peter Oskolkov wrote:
> On Fri, Jul 9, 2021 at 1:02 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:

> > This is horrible... Jann is absolutely right, you do not, *ever* do
> > userspace spinlocks. What's wrong with the trivial lockless single
> > linked list approach?.
>
> I'm not sure how to get around the ABA problem with a lockless single
> linked list: https://en.wikipedia.org/wiki/ABA_problem

I'm familiar with the problem. I'm just not sure how we got there.

Last time we had umcg_blocked_ptr / umcg_runnable_ptr which were kernel
append, user clear single linked lists used for RUNNING->BLOCKED and
BLOCKED->RUNNABLE notifications.

But those seem gone, instead we now have idle_servers_ptr /
idle_workers_ptr. I've not yet fully digested things, but they seem to
implement some 'SMP' policy. Can we please forget about the whole SMP
thing for now and focus on getting the basics sorted?

So if we implement the bits outlined here:

https://lore.kernel.org/linux-api/20210609125435.GA68187@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
https://lore.kernel.org/linux-api/YMJTyVVdylyHtkeW@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/

Then what is mising for full N:1 (aka UP) ?

One thing I've considered is that we might want to add u64 timestamps
for the various state transitions, such that userspace can compute
elapsed time which is useful for more dynamic scheduling policies.

Another is preemption; I'm thinking we can drive that with signals, but
that does give complications -- signals are super gross API wise.
Another method would be much preferred I think. We could add another
syscall which allows userspace (from eg. SIGALRM/SIGPROF/SIGVTALRM) to
force a worker to do a RUNNING->RUNNABLE transition and schedule back to
the server.


Then lets consider N:M (aka SMP). The basics of SMP is sharing work
between servers. For a large part userspace can already do that by
keeping a shared ready queue. Servers that go idle pick up a work,
re-assign it to themselves and go.

The pain-point seems to be *efficient* BLOCKED->RUNNABLE notifications
across servers. Inefficient options include the userspace servers
iterating all known other servers and trying to steal their
umcg_runnable_ptr and processing it. This is 'difficult' in that there
is no natural wakeup and hence letting a server do IDLE will increase
latency and destroy work concervance.

The alternative seems to be to let the kernel do the server iteration,
looking for a RUNNING one and using that umcg_runnable_ptr field and
kicking it. For that we can set up an immutable linked list between
struct umcg_task's, a circular single linked list that the kernel
iterates until it's back where it started. Then there is no dymaic
state.

Hmm?