Re: [RFC PATCH 00/11] Reviving the Proxy Execution Series

From: Joel Fernandes
Date: Sun Oct 16 2022 - 22:23:15 EST


On Mon, Oct 03, 2022 at 09:44:50PM +0000, Connor O'Brien wrote:
> Proxy execution is an approach to implementing priority inheritance
> based on distinguishing between a task's scheduler context (information
> required in order to make scheduling decisions about when the task gets
> to run, such as its scheduler class and priority) and its execution
> context (information required to actually run the task, such as CPU
> affinity). With proxy execution enabled, a task p1 that blocks on a
> mutex remains on the runqueue, but its "blocked" status and the mutex on
> which it blocks are recorded. If p1 is selected to run while still
> blocked, the lock owner p2 can run "on its behalf", inheriting p1's
> scheduler context. Execution context is not inherited, meaning that
> e.g. the CPUs where p2 can run are still determined by its own affinity
> and not p1's.
>
> In practice a number of more complicated situations can arise: the mutex
> owner might itself be blocked on another mutex, or it could be sleeping,
> running on a different CPU, in the process of migrating between CPUs,
> etc. Details on handling these various cases can be found in patch 7/11
> ("sched: Add proxy execution"), particularly in the implementation of
> proxy() and accompanying comments.
>
> Past discussions of proxy execution have often focused on the benefits
> for deadline scheduling. Current interest for Android is based more on
> desire for a broad solution to priority inversion on kernel mutexes,
> including among CFS tasks. One notable scenario arises when cpu cgroups
> are used to throttle less important background tasks. Priority inversion
> can occur when an "important" unthrottled task blocks on a mutex held by
> an "unimportant" task whose CPU time is constrained using cpu
> shares. The result is higher worst case latencies for the unthrottled
> task.[0] Testing by John Stultz with a simple reproducer [1] showed
> promising results for this case, with proxy execution appearing to
> eliminate the large latency spikes associated with priority
> inversion.[2]
>
> Proxy execution has been discussed over the past few years at several
> conferences[3][4][5], but (as far as I'm aware) patches implementing the
> concept have not been discussed on the list since Juri Lelli's RFC in
> 2018.[6] This series is an updated version of that patchset, seeking to
> incorporate subsequent work by Juri[7], Valentin Schneider[8] and Peter
> Zijlstra along with some new fixes.
>
> Testing so far has focused on stability, mostly via mutex locktorture
> with some tweaks to more quickly trigger proxy execution bugs. These
> locktorture changes are included at the end of the series for
> reference. The current series survives runs of >72 hours on QEMU without
> crashes, deadlocks, etc. Testing on Pixel 6 with the android-mainline
> kernel [9] yields similar results. In both cases, testing used >2 CPUs
> and CONFIG_FAIR_GROUP_SCHED=y, a configuration Valentin Schneider
> reported[10] showed stability problems with earlier versions of the
> series.
>
> That said, these are definitely still a work in progress, with some
> known remaining issues (e.g. warnings while booting on Pixel 6,
> suspicious looking min/max vruntime numbers) and likely others I haven't
> found yet. I've done my best to eliminate checks and code paths made
> redundant by new fixes but some probably remain. There's no attempt yet
> to handle core scheduling. Performance testing so far has been limited
> to the aforementioned priority inversion reproducer. The hope in sharing
> now is to revive the discussion on proxy execution and get some early
> input for continuing to revise & refine the patches.

I ran a test to check CFS time sharing. The accounting on top is confusing,
but ftrace confirms the proxying happening.

Task A - pid 122
Task B - pid 123
Task C - pid 121
Task D - pid 124

Here D and B just spin all the time. C is lock owner (in-kernel mutex) and
spins all the time, while A blocks on the same in-kernel mutex and remains
blocked.

Then I did "top -H" while the test was running which gives below output.
The first column is PID, and the third-last column is CPU percentage.

Without PE:
121 root 20 0 99496 4 0 R 33.6 0.0 0:02.76 t (task C)
123 root 20 0 99496 4 0 R 33.2 0.0 0:02.75 t (task B)
124 root 20 0 99496 4 0 R 33.2 0.0 0:02.75 t (task D)

With PE:
PID
122 root 20 0 99496 4 0 D 25.3 0.0 0:22.21 t (task A)
121 root 20 0 99496 4 0 R 25.0 0.0 0:22.20 t (task C)
123 root 20 0 99496 4 0 R 25.0 0.0 0:22.20 t (task B)
124 root 20 0 99496 4 0 R 25.0 0.0 0:22.20 t (task D)

With PE, I was expecting 2 threads with 25% and 1 thread with 50%. Instead I
get 4 threads with 25% in the top. Ftrace confirms that the D-state task is
in fact not running and proxying to the owner task so everything seems
working correctly, but the accounting seems confusing, as in, it is confusing
to see the D-state task task taking 25% CPU when it is obviously "sleeping".

Yeah, yeah, I know D is proxying for C (while being in the uninterruptible
sleep state), so may be it is OK then, but I did want to bring this up :-)

thanks,

- Joel


> [0] https://raw.githubusercontent.com/johnstultz-work/priority-inversion-demo/main/results/charts/6.0-rc7-throttling-starvation.png
> [1] https://github.com/johnstultz-work/priority-inversion-demo
> [2] https://raw.githubusercontent.com/johnstultz-work/priority-inversion-demo/main/results/charts/6.0-rc7-vanilla-vs-proxy.png
> [3] https://lpc.events/event/2/contributions/62/
> [4] https://lwn.net/Articles/793502/
> [5] https://lwn.net/Articles/820575/
> [6] https://lore.kernel.org/lkml/20181009092434.26221-1-juri.lelli@xxxxxxxxxx/
> [7] https://github.com/jlelli/linux/tree/experimental/deadline/proxy-rfc-v2
> [8] https://gitlab.arm.com/linux-arm/linux-vs/-/tree/mainline/sched/proxy-rfc-v3/
> [9] https://source.android.com/docs/core/architecture/kernel/android-common
> [10] https://lpc.events/event/7/contributions/758/attachments/585/1036/lpc20-proxy.pdf#page=4
>
> Connor O'Brien (2):
> torture: support randomized shuffling for proxy exec testing
> locktorture: support nested mutexes
>
> Juri Lelli (3):
> locking/mutex: make mutex::wait_lock irq safe
> kernel/locking: Expose mutex_owner()
> sched: Fixup task CPUs for potential proxies.
>
> Peter Zijlstra (4):
> locking/ww_mutex: Remove wakeups from under mutex::wait_lock
> locking/mutex: Rework task_struct::blocked_on
> sched: Split scheduler execution context
> sched: Add proxy execution
>
> Valentin Schneider (2):
> kernel/locking: Add p->blocked_on wrapper
> sched/rt: Fix proxy/current (push,pull)ability
>
> include/linux/mutex.h | 2 +
> include/linux/sched.h | 15 +-
> include/linux/ww_mutex.h | 3 +
> init/Kconfig | 7 +
> init/init_task.c | 1 +
> kernel/Kconfig.locks | 2 +-
> kernel/fork.c | 6 +-
> kernel/locking/locktorture.c | 20 +-
> kernel/locking/mutex-debug.c | 9 +-
> kernel/locking/mutex.c | 109 +++++-
> kernel/locking/ww_mutex.h | 31 +-
> kernel/sched/core.c | 679 +++++++++++++++++++++++++++++++++--
> kernel/sched/deadline.c | 37 +-
> kernel/sched/fair.c | 33 +-
> kernel/sched/rt.c | 63 ++--
> kernel/sched/sched.h | 42 ++-
> kernel/torture.c | 10 +-
> 17 files changed, 955 insertions(+), 114 deletions(-)
>
> --
> 2.38.0.rc1.362.ged0d419d3c-goog
>