[PATCH 3/5] sched/fair: Add support for hints in the subsequent wakeup path

From: K Prateek Nayak
Date: Sat Sep 10 2022 - 06:55:26 EST


Hints are adhered to as long as there are idle cores in the target MC
domain. Beyond that, the default behavior is followed.

- Hinting flow in the wakeup path

Following is the flow with wakeup hints:

o Check if the task has a wakeup hint set and whether the current
CPU and the CPU where the task previously ran are on two different
LLCs. If either is false, bail out and follow the default logic.
o Check whether the previous CPU or the current CPU is the desired
CPU according to the set hint.
o Test for idle cores in the MC domain of the hinted CPU.
o If yes, set the desired CPU as the target for wakeup. The scheduler
will then look for an idle CPU withing the MC domain of the target.
o If test_idle_cores returns false, follow the default wakeup path.

PR_SCHED_HINT_WAKE_AFFINE will favor an affine wakeup if the MC where
the waker is running advertises idle core. PR_SCHED_HINT_WAKE_HOLD will
bias the wakeup to MC domain where the task previously ran.

- Results

Following are results from running hackbench with only wakeup hints on a
dual socket Zen3 system in NPS1 mode:

o Hackbench

Test: tip no-hint wake_affine wake_hold
1-groups: 4.31 (0.00 pct) 4.46 (-3.48 pct) 4.20 (2.55 pct) 4.11 (4.64 pct)
2-groups: 4.93 (0.00 pct) 4.85 (1.62 pct) 4.74 (3.85 pct) 5.15 (-4.46 pct)
4-groups: 5.38 (0.00 pct) 5.35 (0.55 pct) 5.04 (6.31 pct) 4.54 (15.61 pct)
8-groups: 5.59 (0.00 pct) 5.49 (1.78 pct) 5.39 (3.57 pct) 5.71 (-2.14 pct)
16-groups: 7.18 (0.00 pct) 7.38 (-2.78 pct) 7.24 (-0.83 pct) 7.76 (-8.07 pct)

As we can observe, the hint PR_SCHED_HINT_WAKE_AFFINE helps performance
across all hackbench configurations. PR_SCHED_HINT_WAKE_HOLD does not
show any consistent behavior and can lead to unpredictable behavior in
hackbench.

- Shortcomings

In schbench, the delay to indicate that no idle core is available in
target MC domain leads to pileup and severe degradation in p99 latency

o schbench

workers: tip no-hint wake_affine wake_hold
1: 37.00 (0.00 pct) 38.00 (-2.70 pct) 18.00 (51.35 pct) 32.00 (13.51 pct)
2: 39.00 (0.00 pct) 36.00 (7.69 pct) 18.00 (53.84 pct) 36.00 (7.69 pct)
4: 41.00 (0.00 pct) 41.00 (0.00 pct) 21.00 (48.78 pct) 33.00 (19.51 pct)
8: 53.00 (0.00 pct) 54.00 (-1.88 pct) 31.00 (41.50 pct) 51.00 (3.77 pct)
16: 73.00 (0.00 pct) 74.00 (-1.36 pct) 2636.00 (-3510.95 pct) 75.00 (-2.73 pct)
32: 116.00 (0.00 pct) 124.00 (-6.89 pct) 15696.00 (-13431.03 pct) 124.00 (-6.89 pct)
64: 217.00 (0.00 pct) 215.00 (0.92 pct) 15280.00 (-6941.47 pct) 224.00 (-3.22 pct)
128: 477.00 (0.00 pct) 440.00 (7.75 pct) 14800.00 (-3002.72 pct) 493.00 (-3.35 pct)
256: 1062.00 (0.00 pct) 1026.00 (3.38 pct) 15696.00 (-1377.96 pct) 1026.00 (3.38 pct)
512: 47552.00 (0.00 pct) 47168.00 (0.80 pct) 60736.00 (-27.72 pct) 49856.00 (-4.84 pct)

Wake hold seems to still do well by reducing the larger latency samples
that we observe during task migration.

- Potential Solution

One potential solution is to atomically read nr_busy_cpus member of
sched_domain_shared struct but the performance impact of this is yet to
be evaluated in the wakeup path.

Signed-off-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
---
kernel/sched/fair.c | 43 ++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 42 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index efceb670e755..90e523cd8de8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -51,6 +51,8 @@

#include <linux/sched/cond_resched.h>

+#include <uapi/linux/prctl.h>
+
#include "sched.h"
#include "stats.h"
#include "autogroup.h"
@@ -7031,6 +7033,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
int want_affine = 0;
/* SD_flags and WF_flags share the first nibble */
int sd_flag = wake_flags & 0xF;
+ bool use_hint = false;
+ unsigned int task_hint = READ_ONCE(p->hint);
+ unsigned int wakeup_hint = task_hint &
+ (PR_SCHED_HINT_WAKE_AFFINE | PR_SCHED_HINT_WAKE_HOLD);

/*
* required for stable ->cpus_allowed
@@ -7046,6 +7052,37 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
new_cpu = prev_cpu;
}

+ /*
+ * Handle the case where a hint is set and the current CPU
+ * and the previous CPU where task ran don't share caches.
+ */
+ if (wakeup_hint && !cpus_share_cache(cpu, prev_cpu)) {
+ /*
+ * Start by assuming the hint is PR_SCHED_HINT_WAKE_AFFINE
+ * setting the target_cpu to the current CPU.
+ */
+ int target_cpu = cpu;
+
+ /*
+ * If the hint is PR_SCHED_HINT_WAKE_HOLD
+ * change target_cpu to the prev_cpu.
+ */
+
+ if (wakeup_hint & PR_SCHED_HINT_WAKE_HOLD)
+ target_cpu = prev_cpu;
+
+ /*
+ * If a wakeup hint is set, try to bias the
+ * task placement towards the preferred node
+ * as long as there is an idle core in the
+ * targetted LLC.
+ */
+ if (test_idle_cores(target_cpu, false)) {
+ use_hint = true;
+ new_cpu = target_cpu;
+ }
+ }
+
want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr);
}

@@ -7057,7 +7094,11 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
*/
if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
- if (cpu != prev_cpu)
+ /*
+ * In case it is optimal to follow the hints,
+ * do not re-evaluate the target CPU.
+ */
+ if (cpu != prev_cpu && !use_hint)
new_cpu = wake_affine(tmp, p, cpu, prev_cpu, sync);

sd = NULL; /* Prefer wake_affine over balance flags */
--
2.25.1