[PATCH 4/5] sched/fair: Consider hints in the initial task wakeup path

From: K Prateek Nayak
Date: Sat Sep 10 2022 - 06:55:41 EST


These hints influence the behavior of the initial task placement and bias
the placement towards or away from the CPU where the task is forked.

The flow is as follows:
- When a fork time hint is set, the NUMA biases are overlooked and only
the sched_group's statistics computed by update_sg_wakeup_stats()
(Number of idle CPUs and total utilization of the group) for the local
group and the idlest group is considered while making initial task
placement decision when both groups have idle CPUs.
- In case a bias towards local group is hinted, go for the local group as
long as an equivalent of idle core is present.
Note: The current implements assume the system running the patch is
SMT-2. Further optimizations can be made for systems with SMT-4,
SMT-8, or with no SMT.
- If a hint for spread is set, and there is a tie in number of idle CPUs
in local group and idlest group, use the utilization of group as the
tie breaking metric.

PR_SCHED_HINT_FORK_AFFINE enables consolidation until half of the local
group is filled. PR_SCHED_HINT_FORK_SPREAD will choose the target group
based on the utilization if there is a tie in number of idle CPUs.

These hints can be set individually in addition to wakeup hints.

- Results

Following are results from using individual fork time hints and
combination of fork time hints and wakeup hints on various benchmark on
a dual socket Zen3 system:

o Only fork time hint:

- Hackbench

Test: tip no-hint fork_affine fork_spread
1-groups: 4.31 (0.00 pct) 4.46 (-3.48 pct) 4.27 (0.92 pct) 4.28 (0.69 pct)
2-groups: 4.93 (0.00 pct) 4.85 (1.62 pct) 4.91 (0.40 pct) 5.15 (-4.46 pct)
4-groups: 5.38 (0.00 pct) 5.35 (0.55 pct) 5.36 (0.37 pct) 5.31 (1.30 pct)
8-groups: 5.59 (0.00 pct) 5.49 (1.78 pct) 5.51 (1.43 pct) 5.51 (1.43 pct)
16-groups: 7.18 (0.00 pct) 7.38 (-2.78 pct) 7.31 (-1.81 pct) 7.25 (-0.97 pct)

- schbench

workers: tip no-hint fork_affine
1: 37.00 (0.00 pct) 38.00 (-2.70 pct) 17.00 (54.05 pct)
2: 39.00 (0.00 pct) 36.00 (7.69 pct) 21.00 (46.15 pct)
4: 41.00 (0.00 pct) 41.00 (0.00 pct) 28.00 (31.70 pct)
8: 53.00 (0.00 pct) 54.00 (-1.88 pct) 39.00 (26.41 pct)
16: 73.00 (0.00 pct) 74.00 (-1.36 pct) 68.00 (6.84 pct)
32: 116.00 (0.00 pct) 124.00 (-6.89 pct) 113.00 (2.58 pct)
64: 217.00 (0.00 pct) 215.00 (0.92 pct) 205.00 (5.52 pct)
128: 477.00 (0.00 pct) 440.00 (7.75 pct) 445.00 (6.70 pct)
256: 1062.00 (0.00 pct) 1026.00 (3.38 pct) 1007.00 (5.17 pct)
512: 47552.00 (0.00 pct) 47168.00 (0.80 pct) 47296.00 (0.53 pct)

- tbench

Clients: tip no-hint fork_affine fork_spread
1 573.26 (0.00 pct) 572.29 (-0.16 pct) 572.70 (-0.09 pct) 569.64 (-0.63 pct)
2 1131.19 (0.00 pct) 1119.57 (-1.02 pct) 1131.97 (0.06 pct) 1101.03 (-2.66 pct)
4 2100.07 (0.00 pct) 2070.66 (-1.40 pct) 2094.80 (-0.25 pct) 2011.64 (-4.21 pct)
8 3809.88 (0.00 pct) 3784.16 (-0.67 pct) 3458.94 (-9.21 pct) 3867.70 (1.51 pct)
16 6560.72 (0.00 pct) 6449.64 (-1.69 pct) 6342.78 (-3.32 pct) 6700.50 (2.13 pct)
32 12203.23 (0.00 pct) 12180.02 (-0.19 pct) 10411.44 (-14.68 pct) 13104.29 (7.38 pct)
64 22389.81 (0.00 pct) 23084.51 (3.10 pct) 16614.14 (-25.79 pct) 24353.76 (8.77 pct)
128 32449.37 (0.00 pct) 33561.28 (3.42 pct) 19971.67 (-38.45 pct) 36201.16 (11.56 pct)
256 58962.40 (0.00 pct) 59118.43 (0.26 pct) 26836.13 (-54.48 pct) 61721.06 (4.67 pct)
512 59608.71 (0.00 pct) 60246.78 (1.07 pct) 36889.55 (-38.11 pct) 59696.57 (0.14 pct)
1024 58037.02 (0.00 pct) 58532.41 (0.85 pct) 39936.06 (-31.18 pct) 57445.62 (-1.01 pct)

All these benchmarks show noticeable improvements only with a slightly
different initial placement. A placement in line with benchmark
behavior improves benchmark results.

o Combination of hints

- Hackbench

Test: tip no-hint fork_affine + wake_affine fork_spread + wake_hold
1-groups: 4.31 (0.00 pct) 4.46 (-3.48 pct) 4.20 (2.55 pct) 4.81 (-11.60 pct)
2-groups: 4.93 (0.00 pct) 4.85 (1.62 pct) 4.74 (3.85 pct) 5.09 (-3.24 pct)
4-groups: 5.38 (0.00 pct) 5.35 (0.55 pct) 5.01 (6.87 pct) 5.62 (-4.46 pct)
8-groups: 5.59 (0.00 pct) 5.49 (1.78 pct) 5.38 (3.75 pct) 5.69 (-1.78 pct)
16-groups: 7.18 (0.00 pct) 7.38 (-2.78 pct) 7.25 (-0.97 pct) 7.97 (-11.00 pct)

Hackbench improves further with pairing of correct wakeup hint with
correct fork time hint. The regression is equally bad with wrong hints
set.

Signed-off-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
---
kernel/sched/fair.c | 34 ++++++++++++++++++++++++++++++++--
1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 90e523cd8de8..4c61bd0e93b3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9262,6 +9262,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
struct sg_lb_stats local_sgs, tmp_sgs;
struct sg_lb_stats *sgs;
unsigned long imbalance;
+ unsigned int task_hint, fork_hint;
struct sg_lb_stats idlest_sgs = {
.avg_load = UINT_MAX,
.group_type = group_overloaded,
@@ -9365,8 +9366,14 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
break;

case group_has_spare:
+ task_hint = READ_ONCE(p->hint);
+ fork_hint = task_hint &
+ (PR_SCHED_HINT_FORK_SPREAD | PR_SCHED_HINT_FORK_AFFINE);
#ifdef CONFIG_NUMA
- if (sd->flags & SD_NUMA) {
+ /*
+ * If a hint is set, override any NUMA preference behavior.
+ */
+ if ((sd->flags & SD_NUMA) && !fork_hint) {
int imb_numa_nr = sd->imb_numa_nr;
#ifdef CONFIG_NUMA_BALANCING
int idlest_cpu;
@@ -9406,14 +9413,37 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
}
#endif /* CONFIG_NUMA */

+ /*
+ * FIXME: Currently the system is assumed to be SMT-2
+ * and that the number of cores in a group can be
+ * estimated by halving the group_weight. Determine a
+ * more generic logic for other SMT possibilities or
+ * derive it at runtime from the topology.
+ */
+ if ((task_hint & PR_SCHED_HINT_FORK_AFFINE) &&
+ local_sgs.idle_cpus > local->group_weight / 2)
+ return NULL;
/*
* Select group with highest number of idle CPUs. We could also
* compare the utilization which is more stable but it can end
* up that the group has less spare capacity but finally more
* idle CPUs which means more opportunity to run task.
*/
- if (local_sgs.idle_cpus >= idlest_sgs.idle_cpus)
+ if (local_sgs.idle_cpus > idlest_sgs.idle_cpus)
+ return NULL;
+
+ if (local_sgs.idle_cpus == idlest_sgs.idle_cpus) {
+ /*
+ * In case of a tie between number of idle CPUs and if
+ * the task hints a benefit from spreading, go with the
+ * group with the lesser utilization.
+ */
+ if ((task_hint & PR_SCHED_HINT_FORK_SPREAD) &&
+ local_sgs.group_util > idlest_sgs.group_util)
+ return idlest;
+
return NULL;
+ }
break;
}

--
2.25.1