[PATCH 3/3] sched/fair: Remove magic hardcoded margin in fits_capacity()

From: Qais Yousef
Date: Mon Feb 05 2024 - 17:34:34 EST


Replace hardcoded margin value in fits_capacity() with better dynamic
logic.

80% margin is a magic value that has served its purpose for now, but it
no longer fits the variety of systems that exist today. If a system is
over powered specifically, this 80% will mean we leave a lot of capacity
unused before we decide to upmigrate on HMP system.

On many systems the little cores are under powered and ability to
migrate faster away from them is desired.

There are two factors that must define when a task is misfit:

1. Due to invariance, time stretches and the ability of a task to reach
certain util value requires longer runtime on smaller CPUs.

This means tasks running on biggest core will require shorter amount
of time to reach max performance point compared to the smaller cores.
To counter the impact of this invariance, we must ensure that the
task migrate faster so that they can reach the max performance point
at a constant rate regardless where they're running on.

2. Misfit migration relies on TICK to help it to migrate. But as a worst
case scenario this TICK can happen after we cross the fits_capacity
threshold. So we must cater for this worst case scenario when
calculating the threshold so tasks don't end up stuck on the wrong
CPU which can cause performance and latency problems.

Below table shows the mapping for the misfit threshold without and with
taking the tick (4ms) into account. Note how the margin needs to be very
large at the bottom end, but very small at the higher end.

It is still large in the 600-750 range where many mid cores lie. So
there are contradiction requirements of enabling tasks to experience
coherent ramp-up time across all CPUs and reduce the latency to achieve
max performance point vs slowing down to allow mid clusters to be more
utilized in 'not so busy' situations.

Not sure if the answer lies in misfit definition. But most likely in
better task placement and load balancing strategies that considers other
factors beside misfit.

cap threshold % threshold-tick %
0 0 0 0 0
16 0 0 0 0
32 1 3.12 0 0
48 3 6.25 2 4.16
64 4 6.25 2 3.12
80 6 7.5 5 6.25
96 10 10.41 8 8.33
112 14 12.5 11 9.82
128 18 14.06 16 12.5
144 21 14.58 18 12.5
160 26 16.25 23 14.37
176 33 18.75 29 16.47
192 39 20.31 35 18.22
208 47 22.59 43 20.67
224 55 24.55 50 22.32
240 63 26.25 59 24.58
256 73 28.51 68 26.56
272 82 30.14 77 28.30
288 93 32.29 87 30.20
304 103 33.88 97 31.90
320 114 35.62 108 33.75
336 126 37.5 120 35.71
352 138 39.20 132 37.5
368 151 41.03 144 39.13
384 163 42.44 157 40.88
400 177 44.25 170 42.5
416 197 47.35 190 45.67
432 212 49.07 204 47.22
448 226 50.44 218 48.66
464 241 51.93 233 50.21
480 263 54.79 255 53.12
496 278 56.04 271 54.63
512 294 57.42 286 55.85
528 317 60.03 309 58.52
544 333 61.21 325 59.74
560 356 63.57 348 62.14
576 380 65.97 372 64.58
592 396 66.89 388 65.54
608 419 68.91 412 67.76
624 443 70.99 435 69.71
640 466 72.81 459 71.71
656 489 74.54 482 73.47
672 512 76.19 505 75.14
688 534 77.61 528 76.74
704 557 79.11 550 78.12
720 578 80.27 572 79.44
736 606 82.33 600 81.52
752 633 84.17 627 83.37
768 653 85.02 647 84.24
784 678 86.47 672 85.71
800 707 88.37 701 87.62
816 729 89.33 724 88.72
832 756 90.86 751 90.26
848 780 91.98 776 91.50
864 803 92.93 799 92.47
880 828 94.09 824 93.63
896 850 94.86 847 94.53
912 874 95.83 871 95.50
928 897 96.65 894 96.33
944 921 97.56 919 97.35
960 943 98.22 941 98.02
976 964 98.77 963 98.66
992 984 99.19 983 99.09
1008 1004 99.60 1004 99.60
1024 1022 99.80 1022 99.80

Signed-off-by: Qais Yousef <qyousef@xxxxxxxxxxx>
---
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 42 +++++++++++++++++++++++++++++++++++-------
kernel/sched/sched.h | 1 +
3 files changed, 37 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index db4be4921e7f..def7af257270 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10004,6 +10004,7 @@ void __init sched_init(void)
rq->sd = NULL;
rq->rd = NULL;
rq->cpu_capacity = SCHED_CAPACITY_SCALE;
+ rq->fits_capacity_threshold = SCHED_CAPACITY_SCALE;
rq->balance_callback = &balance_push_callback;
rq->active_balance = 0;
rq->next_balance = jiffies;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b803030c3a03..630eae0470ca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -101,11 +101,22 @@ int __weak arch_asym_cpu_priority(int cpu)
}

/*
- * The margin used when comparing utilization with CPU capacity.
+ * fits_capacity() must ensure that a task will not be 'stuck' on a CPU with
+ * lower capacity for too long. This can happen because of two reasons:
*
- * (default: ~20%)
+ * 1. Capacity and frequency invariance means the runtime on each CPU is not
+ * the same. We want the task to experience the same ramp-up time to reach
+ * max performance point of the system as if they were running on the
+ * biggest CPU.
+ *
+ * 2. A misfit migration will require a tick as a worst case scenario to
+ * migrate it to a bigger CPU. So we must cater for this time and make sure
+ * it migrates before it becomes a true misfit.
*/
-#define fits_capacity(cap, max) ((cap) * 1280 < (max) * 1024)
+static inline bool fits_capacity(unsigned long util, int cpu)
+{
+ return util < cpu_rq(cpu)->fits_capacity_threshold;
+}

/*
* The margin used when comparing CPU capacities.
@@ -4965,13 +4976,12 @@ static inline int util_fits_cpu(unsigned long util,
int cpu)
{
unsigned long capacity_orig, capacity_orig_thermal;
- unsigned long capacity = capacity_of(cpu);
bool fits, uclamp_max_fits;

/*
* Check if the real util fits without any uclamp boost/cap applied.
*/
- fits = fits_capacity(util, capacity);
+ fits = fits_capacity(util, cpu);

if (!uclamp_is_used())
return fits;
@@ -9522,12 +9532,30 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
{
unsigned long capacity = scale_rt_capacity(cpu);
struct sched_group *sdg = sd->groups;
+ struct rq *rq = cpu_rq(cpu);
+ u64 limit;

if (!capacity)
capacity = 1;

- cpu_rq(cpu)->cpu_capacity = capacity;
- trace_sched_cpu_capacity_tp(cpu_rq(cpu));
+ rq->cpu_capacity = capacity;
+ trace_sched_cpu_capacity_tp(rq);
+
+ /*
+ * Calculate the util at which the task must be considered a misfit.
+ *
+ * We must ensure that a task experiences the same ramp-up time to
+ * reach max performance point of the system regardless of the CPU it
+ * is running on (due to invariance, time will stretch and task will
+ * take longer to achieve the same util value compared to a task
+ * running on a big CPU) and a delay in misfit migration which depends
+ * on TICK doesn't end up hurting it as it can happen after we would
+ * have crossed this threshold.
+ */
+ limit = approximate_runtime(arch_scale_cpu_capacity(cpu)) * USEC_PER_MSEC;
+ limit -= TICK_USEC; /* sd->balance_interval is more accurate */
+ limit = cap_scale(limit, arch_scale_cpu_capacity(cpu));
+ rq->fits_capacity_threshold = approximate_util_avg(0, limit);

sdg->sgc->capacity = capacity;
sdg->sgc->min_capacity = capacity;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d078e6481bd0..92c29b6d3cc5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1054,6 +1054,7 @@ struct rq {
struct sched_domain __rcu *sd;

unsigned long cpu_capacity;
+ unsigned long fits_capacity_threshold;

struct balance_callback *balance_callback;

--
2.34.1