[PATCH 2/2] sched/fair: Relax task_hot() for misfit tasks
From: Valentin Schneider
Date: Thu Apr 15 2021 - 13:59:05 EST
Consider the following topology:
DIE [ ]
MC [ ][ ]
0 1 2 3
capacity_orig_of(x \in {0-1}) < capacity_orig_of(x \in {2-3})
w/ CPUs 2-3 idle and CPUs 0-1 running CPU hogs (util_avg=1024).
When CPU2 goes through load_balance() (via periodic / NOHZ balance), it
should pull one CPU hog from either CPU0 or CPU1 (this is misfit task
upmigration). However, should a e.g. pcpu kworker awake on CPU0 just before
this load_balance() happens and preempt the CPU hog running there, we would
have, for the [0-1] group at CPU2's DIE level:
o sgs->sum_nr_running > sgs->group_weight
o sgs->group_capacity * 100 < sgs->group_util * imbalance_pct
IOW, this group is group_overloaded.
Considering CPU0 is picked by find_busiest_queue(), we would then visit the
preempted CPU hog in detach_tasks(). However, given it has just been
preempted by this pcpu kworker, task_hot() will prevent it from being
detached. We then leave load_balance() without having done anything.
Long story short, preempted misfit tasks are affected by task_hot(), while
currently running misfit tasks are intentionally preempted by the stopper
task to migrate them over to a higher-capacity CPU.
Align detach_tasks() with the active-balance logic and let it pick a
cache-hot misfit task when the destination CPU can provide a capacity
uplift.
Signed-off-by: Valentin Schneider <valentin.schneider@xxxxxxx>
---
kernel/sched/fair.c | 36 ++++++++++++++++++++++++++++++++++++
1 file changed, 36 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d2d1a69d7aa7..43fc98d34276 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7493,6 +7493,7 @@ struct lb_env {
enum fbq_type fbq_type;
enum migration_type migration_type;
enum group_type src_grp_type;
+ enum group_type dst_grp_type;
struct list_head tasks;
};
@@ -7533,6 +7534,31 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
return delta < (s64)sysctl_sched_migration_cost;
}
+
+/*
+ * What does migrating this task do to our capacity-aware scheduling criterion?
+ *
+ * Returns 1, if the task needs more capacity than the dst CPU can provide.
+ * Returns 0, if the task needs the extra capacity provided by the dst CPU
+ * Returns -1, if the task isn't impacted by the migration wrt capacity.
+ */
+static int migrate_degrades_capacity(struct task_struct *p, struct lb_env *env)
+{
+ if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
+ return -1;
+
+ if (!task_fits_capacity(p, capacity_of(env->src_cpu))) {
+ if (cpu_capacity_greater(env->dst_cpu, env->src_cpu))
+ return 0;
+ else if (cpu_capacity_greater(env->src_cpu, env->dst_cpu))
+ return 1;
+ else
+ return -1;
+ }
+
+ return task_fits_capacity(p, capacity_of(env->dst_cpu)) ? -1 : 1;
+}
+
#ifdef CONFIG_NUMA_BALANCING
/*
* Returns 1, if task migration degrades locality
@@ -7672,6 +7698,15 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
if (tsk_cache_hot == -1)
tsk_cache_hot = task_hot(p, env);
+ /*
+ * On a (sane) asymmetric CPU capacity system, the increase in compute
+ * capacity should offset any potential performance hit caused by a
+ * migration.
+ */
+ if ((env->dst_grp_type == group_has_spare) &&
+ !migrate_degrades_capacity(p, env))
+ tsk_cache_hot = 0;
+
if (tsk_cache_hot <= 0 ||
env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
if (tsk_cache_hot == 1) {
@@ -9310,6 +9345,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
if (!sds.busiest)
goto out_balanced;
+ env->dst_grp_type = local->group_type;
env->src_grp_type = busiest->group_type;
/* Misfit tasks should be dealt with regardless of the avg load */
--
2.25.1