Re: [RFC PATCH 2/3 v0.2] sched/umcg: RFC: add userspace atomic helpers
From: Peter Oskolkov
Date: Fri Jul 09 2021 - 00:01:57 EST
On Thu, Jul 8, 2021 at 2:12 PM Jann Horn <jannh@xxxxxxxxxx> wrote:
>
> On Thu, Jul 8, 2021 at 9:46 PM Peter Oskolkov <posk@xxxxxxx> wrote:
> > Add helper functions to work atomically with userspace 32/64 bit values -
> > there are some .*futex.* named helpers, but they are not exactly
> > what is needed for UMCG; I haven't found what else I could use, so I
> > rolled these.
> >
> > At the moment only X86_64 is supported.
> >
> > Note: the helpers should probably go into arch/ somewhere; I have
> > them in kernel/sched/umcg.h temporarily for convenience. Please
> > let me know where I should put them and how to name them.
>
> Instead of open-coding spinlocks in userspace memory like this (which
> some of the reviewers will probably dislike because it will have
> issues around priority inversion and such), I wonder whether you could
> use an actual futex as your underlying locking primitive?
>
> The most straightforward way to do that would probably be to make the
> head structure in userspace look roughly like this?
>
> struct umcg_head {
> u64 head_ptr;
> u32 lock;
> };
>
> and then from kernel code, you could build a fastpath that directly
> calls cmpxchg_futex_value_locked() and build a fallback based on
> do_futex(), or something like that.
>
> There is precedent for using futex from inside the kernel to
> communicate with userspace: See mm_release(), which calls do_futex()
> with FUTEX_WAKE for the clear_child_tid feature.
Hi Jann,
Thanks for the note!
The approach you suggest will require locking every operation, I
believe, while in the scheme I have pushes/inserts are lock-free if
there are no concurrent pops/deletes. And the kernel does mostly
pushes (waking workers, and there can be a lot of workers), while pops
are rare (idle servers, and there is no reason for the number of
servers to exceed the number of CPUs substantially, and if there is
contention here, it will be very short-lived), while the userspace
will pop the entire stack of idle workers in one op (so a short lock
as well). So I think my approach scales better. And priority inversion
should not matter here, because this is for userspace scheduling, and
so the userspace scheduler should worry about it, not the kernel.
Am I missing something?
Thanks,
Peter