Re: [POC][RFC][PATCH] sched: Extended Scheduler Time Slice

From: Mateusz Guzik
Date: Wed Oct 25 2023 - 12:24:52 EST

Next message: Eric Snowberg: "Re: [PATCH v3] ima: detect changes to the backing overlay file"
Previous message: Ryan Roberts: "Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory"
In reply to: Mathieu Desnoyers: "Re: [POC][RFC][PATCH] sched: Extended Scheduler Time Slice"
Next in thread: Steven Rostedt: "Re: [POC][RFC][PATCH] sched: Extended Scheduler Time Slice"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, Oct 25, 2023 at 11:42:34AM -0400, Mathieu Desnoyers wrote:
> On 2023-10-25 10:31, Steven Rostedt wrote:
> > On Wed, 25 Oct 2023 15:55:45 +0200
> > Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> [...]
>
> After digging lore for context, here are some thoughts about the actual
> proposal: AFAIU the intent here is to boost the scheduling slice for a
> userspace thread running with a mutex held so it can complete faster,
> and therefore reduce contention.
>
> I suspect this is not completely unrelated to priority inheritance
> futexes, except that one goal stated by Steven is to increase the
> owner slice without requiring to call a system call on the fast-path.
>
> Compared to PI futexes, I think Steven's proposal misses the part
> where a thread waiting on a futex boosts the lock owner's priority
> so it can complete faster. By making the lock owner selfishly claim
> that it needs a larger scheduling slice, it opens the door to
> scheduler disruption, and it's hard to come up with upper-bounds
> that work for all cases.
>
> Hopefully I'm not oversimplifying if I state that we have mainly two
> actors to consider:
>
> [A] the lock owner thread
>
> [B] threads that block trying to acquire the lock
>
> The fast-path here is [A]. [B] can go through a system call, I don't
> think it matters at all.
>
> So perhaps we can extend the rseq per-thread area with a field that
> implements a "held locks" list that allows [A] to let the kernel know
> that it is currently holding a set of locks (those can be chained when
> locks are nested). It would be updated on lock/unlock with just a few
> stores in userspace.
>
> Those lock addresses could then be used as keys for private locks,
> or transformed into inode/offset keys for shared-memory locks. Threads
> [B] blocking trying to acquire the lock can call a system call which
> would boost the lock owner's slice and/or priority for a given lock key.
>
> When the scheduler preempts [A], it would check whether the rseq
> per-thread area has a "held locks" field set and use this information
> to find the slice/priority boost which are currently active for each
> lock, and use this information to boost the task slice/priority
> accordingly.
>
> A scheme like this should allow lock priority inheritance without
> requiring system calls on the userspace lock/unlock fast path.
>

I think both this proposal and the original in this thread are opening a
can of worms and I don't think going down that road was properly
justified. A proper justification would demonstrate a big enough(tm)
improvement over a locking primitive with adaptive spinning.

It is well known that what mostly shafts performance of regular
userspace locking is all the nasty going off cpu to wait.

The original benchmark with slice extension disabled keeps using CPUs,
virtually guaranteeing these threads will keep getting preempted, some
of the time while holding the lock. Should that happen all other threads
which happened to get preempted actively waste time.

Adaptive spinning was already mentioned elsewhere in the thread and the
idea itself is at least 2 decades old. If anything I find it strange it
did not land years ago.

I find there is a preliminary patch by you which exports the state so
one can nicely spin without even going to the kernel:
https://lore.kernel.org/lkml/20230529191416.53955-1-mathieu.desnoyers@xxxxxxxxxxxx/

To be clear, I think a locking primitive which can do adaptive spinning
*and* futexes *and* not get preempted while holding locks is the fastest
option. What is not clear to me if it is sufficiently faster than
adaptive spinning and futexes.

tl;dr perhaps someone(tm) could carry the above to a state where it can
be benchmarked vs the original patch

Next message: Eric Snowberg: "Re: [PATCH v3] ima: detect changes to the backing overlay file"
Previous message: Ryan Roberts: "Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory"
In reply to: Mathieu Desnoyers: "Re: [POC][RFC][PATCH] sched: Extended Scheduler Time Slice"
Next in thread: Steven Rostedt: "Re: [POC][RFC][PATCH] sched: Extended Scheduler Time Slice"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]