[PATCH v5 00/19] Proxy Execution: A generalized form of Priority Inheritance v5

From: John Stultz
Date: Sat Aug 19 2023 - 02:10:02 EST


Getting this iteration of the patch series out to the list has
taken way longer than I had hoped! Right before I sent out the
v4 release, I had found that while the patch series was fairly
stable with lower cpu counts, I could very easily run into
crashes with higher cpus during bootup. Not being able to
quickly resolve the crashes I was seeing, I went ahead and sent
out v4 for review, and figured I’d debug the issue and send out
v5 as a quick follow on.

But unfortunately trying to diagnose the failures and fix them
started uncovering other crashes. Additionally, the v4 patch
series wasn’t properly bisectable as earlier changes were
missing fixes from later in the series, which made debugging
issues quite difficult. So after playing constant wack-a-mole
with bugs, I took the entire engine apart and laid all the bits
on the floor, and started slowly re-assembling things, testing
each step as I went.

This has been very laborious, as cutting small chunks from the
larger patches off would run into self-caused issues that I
spent much time debugging, due to my missing needed logic still
in the larger patch. But slowly I’ve managed to get almost all
the fine-grained patches to boot and run.

In fact, the patch series here is coarser than what I’ve
created, as there are a number of small “test” steps to help
validate behavior I changed, which would then be replaced by the
real logic afterwards. Including those here would just cause
more work for reviewers, but if you’re interested you can find
the fine-grained tree here:
https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v5-6.5-rc6-fine-grained
https://github.com/johnstultz-work/linux-dev.git proxy-exec-v5-6.5-rc6-fine-grained

Having this fine grained tree really helped me chase down and
resolve a large number of bugs in the logic. In some cases I’ve
significantly reworked parts of the changes, so for instance, I
found the previous versions of the patch had a fair amount of
logic changing the blocked_on state in the try_to_wake_up paths
which were racy particularly with the return-migration. I’ve
moved the return-migration logic into __sched so we make sure we
don’t run proxy-migrated blocked tasks on an incorrect cpu when
they are unblocked. This change has its own complexity, so
feedback/suggestions for improvements would be appreciated. But
as a result, the patch series is much more stable (particularly
the earlier components).

Now, I have run into a few issues still, particularly around the
enqueuing of tasks on deactivated/sleeping owners, particularly
with the test-ww_mutex logic (ww_mutexes have been difficult as
they break assumptions that tasks are unblocked along the
blocked_on chain in an orderly fashion - Instead a task in the
middle of the chain may become suddenly runnable because it got
an EDEADLK from its ww_mutex). Additionally I haven’t been able
to test and debug chain migration. So I’ve left those patches out
for now, but will re-include them on the next revision. Given the
number of patches here, I suspect reviewers won’t mind. :)

As mentioned previously, this Proxy Execution series has a long
history: First described in a paper[2] by Watkins, Straub,
Niehaus, then from patches from Peter Zijlstra, extended with
lots of work by Juri Lelli, Valentin Schneider, and Connor
O'Brien. (and thank you to Steven Rostedt for providing
additional details here!)

So again, many thanks to those above, as all the credit for this
series really is due to them - while the mistakes are likely
mine.

Overview:
---------
Proxy Execution is a generalized form of priority inheritance.
Classic priority inheritance works well for real-time tasks where
there is a straight forward priority order to how things are run.
But it breaks down when used between CFS or DEADLINE tasks, as
there are lots of parameters involved outside of just the task’s
nice value when selecting the next task to run (via
pick_next_task()). So ideally we want to imbue the mutex holder
with all the scheduler attributes of the blocked waiting task.

Proxy Execution does this via a few changes:
* Keeping tasks that are blocked on a mutex *on* the runqueue
* Keeping additional tracking of which mutex a task is blocked
on, and which task holds a specific mutex.
* Special handling for when we select a blocked task to run, so
that we instead run the mutex holder.

The first of these is the most difficult to grasp (I do get the
mental friction here: blocked tasks on the *run*queue sounds like
nonsense! Personally I like to think of the runqueue in this
model more like a “task-selection queue”).

By leaving blocked tasks on the runqueue, we allow
pick_next_task() to choose the task that should run next (even if
it’s blocked waiting on a mutex). If we do select a blocked task,
we look at the task’s blocked_on mutex and from there look at the
mutex’s owner task. And in the simple case, the task which owns
the mutex is what we then choose to run, allowing it to release
the mutex.

This means that instead of just tracking “curr”, the scheduler
needs to track both the scheduler context (what was picked and
all the state used for scheduling decisions), and the execution
context (what we’re actually running).

In this way, the mutex owner is run “on behalf” of the blocked
task that was picked to run, essentially inheriting the scheduler
context of the waiting blocked task.

As Connor outlined in a previous submission of this patch series,
this raises a number of complicated situations: The mutex owner
might itself be blocked on another mutex, or it could be
sleeping, running on a different CPU, in the process of migrating
between CPUs, etc.

But the functionality provided by Proxy Execution is useful, as
in Android we have a large number of cases where we are seeing
priority inversion (not unbounded, but longer than we’d like)
between “foreground” and “background” SCHED_NORMAL applications,
so having a generalized solution would be very useful.

New in v5:
---------
* Broke the patch series up into fine grained changes

* Heavily reworked the return-migration handling by moving it
out of try_to_wake_up() and into __schedule().

* Reworked blocked_on tracking logic (so it is handled
completely in mutex code), and added a blocked_on_waking
value, so we can distinguish when tasks have not acquired the
mutex, but need to wake up to try to do so.

* Resolved lost-wakeup issues caused by the wake_qs being
managed in the ctx structures instead of being on the stack. I
went back to an earlier version of that patch from Juri though
re-adding some fixes from Connor’s version.

* Resolved null se pointer crashes seen at bootup, caused by
incorrect put_prev_task() calls

* Fixes to the test-ww_mutex test logic that was causing
livelocks & hangs. (Sent as a separate series)

* And more I’ve likely forgotten

Also, I know Peter didn’t like the blocked_on wrappers, so I
dropped them last time, but I found them (and the debug checks
within) crucial to working out issues in this patch series. I’m
fine to consider dropping them later if they are still
objectionable, but their utility was too high at this point to
get rid of them.


Issues still to address:
—----------
* As mentioned above, the deactivated owner handling (where we
deactivate waiting tasks and enqueue them onto a list, then
reactivate them when the owner wakes up) has some major issues
with ww_mutexes. Additionally I think there are other races
possible, since everyone is using a single blocked_entry
list_head, anyone who wakes up may think it’s the front of the
chain and wake everyone else up(not just those waiting on it).
I believe if two processes are woken at the same time, they
could fight trying to consume and activate tasks onto different
cpus.

* Still need to validate and re-add the chain migration patches.

* Seen some rare crashes around rt runqueue accounting. Will be
investigating this.

* “rq_selected()” naming. Peter doesn’t like it, but I’ve not
thought of a better name. Open to suggestions.

* As discussed at OSPM[1], I want to split pick_next_task() up
into two phases selecting and setting the next tasks, as
currently pick_next_task() assumes the returned task will be
run which results in various side-effects in sched class logic
when it’s run.

* CFS load balancing. Blocked tasks may carry forward load (PELT)
to the lock owner's CPU, so CPU may look like it is overloaded.

* I still want to push down the split scheduler and execution
context awareness further through the scheduling code, as lots
of logic still assumes there’s only a single “rq->curr” task.

* Optimization to avoid migrating blocked tasks (allowing for
optimistic spinning) if the runnable lock-owner at the end of
the blocked_on chain is already running (though this is
difficult as locking rules to traverse the blocked on chain
across run queues isn’t trivial).


Performance:
—----------
This patch series switches mutexes to use handoff mode rather
than optimistic spinning. This is a potential concern where locks
are under high contention. However, earlier performance analysis
(on both x86 and mobile devices) did not see major regressions.
That said, Chenyu did report a regression[3], which I’ll need to
look further into. I also briefly re-tested with this v5 series
and saw some average latencies grow suggesting the changes to
return-migration (and extra locking) have some impact. I’ll be
digging more there. As mentioned above, there may be some
additional optimizations that can help here, but my focus is on
getting the code working well before I spend time optimizing.

Review and feedback would be greatly appreciated!

If folks find it easier to test/tinker with, this patch series can also be found here:
https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v5-6.5-rc6
https://github.com/johnstultz-work/linux-dev.git proxy-exec-v5-6.5-rc6

Thanks so much!
-john

[1] https://youtu.be/QEWqRhVS3lI (video of my OSPM talk)
[2] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf
[3] https://lore.kernel.org/lkml/Y7vVqE0M%2FAoDoVbj@chenyu5-mobl1/

Cc: Joel Fernandes <joelaf@xxxxxxxxxx>
Cc: Qais Yousef <qyousef@xxxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Juri Lelli <juri.lelli@xxxxxxxxxx>
Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
Cc: Dietmar Eggemann <dietmar.eggemann@xxxxxxx>
Cc: Valentin Schneider <vschneid@xxxxxxxxxx>
Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
Cc: Ben Segall <bsegall@xxxxxxxxxx>
Cc: Zimuzo Ezeozue <zezeozue@xxxxxxxxxx>
Cc: Youssef Esmat <youssefesmat@xxxxxxxxxx>
Cc: Mel Gorman <mgorman@xxxxxxx>
Cc: Daniel Bristot de Oliveira <bristot@xxxxxxxxxx>
Cc: Will Deacon <will@xxxxxxxxxx>
Cc: Waiman Long <longman@xxxxxxxxxx>
Cc: Boqun Feng <boqun.feng@xxxxxxxxx>
Cc: "Paul E . McKenney" <paulmck@xxxxxxxxxx>
Cc: kernel-team@xxxxxxxxxxx

John Stultz (8):
locking/mutex: Removes wakeups from under mutex::wait_lock
locking/mutex: Split blocked_on logic into two states (blocked_on and
blocked_on_waking)
sched: Fix runtime accounting w/ split exec & sched contexts
sched: Unnest ttwu_runnable in prep for proxy-execution
sched: Split out __sched() deactivate task logic into a helper
sched: Add a very simple proxy() function
sched: Add proxy deactivate helper
sched: Handle blocked-waiter migration (and return migration)

Juri Lelli (2):
locking/mutex: make mutex::wait_lock irq safe
locking/mutex: Expose __mutex_owner()

Peter Zijlstra (7):
sched: Unify runtime accounting across classes
locking/mutex: Rework task_struct::blocked_on
locking/mutex: Add task_struct::blocked_lock to serialize changes to
the blocked_on state
locking/mutex: Switch to mutex handoffs for CONFIG_PROXY_EXEC
sched: Split scheduler execution context
sched: Start blocked_on chain processing in proxy()
sched: Add blocked_donor link to task for smarter mutex handoffs

Valentin Schneider (2):
locking/mutex: Add p->blocked_on wrappers for correctness checks
sched: Fix proxy/current (push,pull)ability

include/linux/sched.h | 33 ++-
init/Kconfig | 7 +
init/init_task.c | 1 +
kernel/Kconfig.locks | 2 +-
kernel/fork.c | 6 +-
kernel/locking/mutex-debug.c | 9 +-
kernel/locking/mutex.c | 128 ++++++---
kernel/locking/mutex.h | 25 ++
kernel/locking/ww_mutex.h | 72 +++--
kernel/sched/core.c | 522 +++++++++++++++++++++++++++++++----
kernel/sched/deadline.c | 50 ++--
kernel/sched/fair.c | 104 +++++--
kernel/sched/rt.c | 78 +++---
kernel/sched/sched.h | 61 +++-
kernel/sched/stop_task.c | 13 +-
15 files changed, 868 insertions(+), 243 deletions(-)

--
2.42.0.rc1.204.g551eb34607-goog