Re: [RFC PATCH 2/3 v0.2] sched/umcg: RFC: add userspace atomic helpers

From: Peter Oskolkov
Date: Tue Jul 13 2021 - 13:15:10 EST

On Tue, Jul 13, 2021 at 9:10 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> On Fri, Jul 09, 2021 at 09:57:32AM -0700, Peter Oskolkov wrote:
> > On Fri, Jul 9, 2021 at 1:02 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> > > This is horrible... Jann is absolutely right, you do not, *ever* do
> > > userspace spinlocks. What's wrong with the trivial lockless single
> > > linked list approach?.
> >
> > I'm not sure how to get around the ABA problem with a lockless single
> > linked list:
> I'm familiar with the problem. I'm just not sure how we got there.
> Last time we had umcg_blocked_ptr / umcg_runnable_ptr which were kernel
> append, user clear single linked lists used for RUNNING->BLOCKED and
> BLOCKED->RUNNABLE notifications.
> But those seem gone, instead we now have idle_servers_ptr /
> idle_workers_ptr. I've not yet fully digested things, but they seem to
> implement some 'SMP' policy. Can we please forget about the whole SMP
> thing for now and focus on getting the basics sorted?

Hmm... yes, I was always working on the M:N model, i.e. multiple
servers, i.e. SMP. Apologies if this was not clear.

I think I'm close to having "SMP basics sorted" - I believe it will
take me longer now to go back to "UP" and then extend this to SMP. And
the result can probably be not as clean as having SMP there from the
beginning. Just give me another couple of days, please!

> So if we implement the bits outlined here:
> Then what is mising for full N:1 (aka UP) ?
> One thing I've considered is that we might want to add u64 timestamps
> for the various state transitions, such that userspace can compute
> elapsed time which is useful for more dynamic scheduling policies.

Tagging state transitions with unique-per-task tags (counters) will
simplify a lot of things in the userspace, as these can be used to
sort out block/wakeup/swap races easily. If these tags have timestamp
semantics (e.g. nanos), even better - as you suggest, scheduling can
be tweaked, tracing/instrumentation will naturally use these, etc.

Synchronizing state+timestamp updates will be a challenge, unless we
share a 64-bit field for state + timestamp: 6 bytes for the timestamp
and two bytes for the state (I think one byte for the state is too
tight, although it will be enough initially). 6 bytes of nanos will
cycle over in a couple of days (if my math is right), so should not be
an issue for uniqueness; and I guess the userspace can always easily
restore the higher two bytes of the timestamp if needed.

So... should I convert u32 state into u64 state+timestamp field?

> Another is preemption; I'm thinking we can drive that with signals, but
> that does give complications -- signals are super gross API wise.
> Another method would be much preferred I think. We could add another
> syscall which allows userspace (from eg. SIGALRM/SIGPROF/SIGVTALRM) to
> force a worker to do a RUNNING->RUNNABLE transition and schedule back to
> the server.

I haven't looked into preemption yet, so I'll try any approach you
suggest (after the "basics" are sorted out).

> Then lets consider N:M (aka SMP). The basics of SMP is sharing work
> between servers. For a large part userspace can already do that by
> keeping a shared ready queue. Servers that go idle pick up a work,
> re-assign it to themselves and go.
> The pain-point seems to be *efficient* BLOCKED->RUNNABLE notifications
> across servers. Inefficient options include the userspace servers
> iterating all known other servers and trying to steal their
> umcg_runnable_ptr and processing it. This is 'difficult' in that there
> is no natural wakeup and hence letting a server do IDLE will increase
> latency and destroy work concervance.
> The alternative seems to be to let the kernel do the server iteration,
> looking for a RUNNING one and using that umcg_runnable_ptr field and
> kicking it. For that we can set up an immutable linked list between
> struct umcg_task's, a circular single linked list that the kernel
> iterates until it's back where it started. Then there is no dymaic
> state.

In full utilization scenarios, which are typical in our use cases
(custom scheduling is not that useful if there are idle CPUs/servers),
the kernel will quite often find no idle servers, so having a dynamic
list/stack of idle servers is more efficient. I'll post what I have in
mind (the kernel marks servers as deleted, the userspace does
cleanup/gc) soon.