[RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task

From: Mathieu Desnoyers
Date: Tue Sep 05 2023 - 13:52:29 EST


Rate limit migrations to 1 migration per 2 milliseconds per task. On a
kernel with EEVDF scheduler (commit b97d64c722598ffed42ece814a2cb791336c6679),
this speeds up hackbench from 62s to 45s on AMD EPYC 192-core (over 2 sockets).

This results in the following benchmark improvements:

hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100

from 62s to 45s. (27% speedup))

And similarly with perf bench:

perf bench sched messaging -g 32 -p -t -l 100000

from 13.0s to 9.5s (26% speedup)

I have noticed that in order to observe the speedup, the workload needs
to keep the CPUs sufficiently busy to cause runqueue lock contention,
but not so busy that they don't go idle. This can be explained by the
fact that idle CPUs are a preferred target for task wakeup runqueue
selection, and therefore having idle cpus causes more migrations, which
triggers more remote wakeups. For both the hackbench and the perf bench
sched messaging benchmarks, the scale of the workload can be tweaked by
changing the number groups.

This was developed as part of the investigation into a weird regression
reported by AMD where adding a raw spinlock in the scheduler context
switch accelerated hackbench. It turned out that changing this raw
spinlock for a loop of 10000x cpu_relax within do_idle() had similar
benefits.

This patch results from the observation that the common effect of the
prior approaches that succeeded in speeding up this workload was to
diminish the number of migrations from 7.5k migrations/s to 1.5k
migrations/s.

This patch shows similar speedup on a 6.4.4 kernel with the CFS
scheduler.

With this patch applied, the "skip queued wakeups only when L2 is
shared" patch [1] brings the hackbench benchmark to 41s (34% speedup
from baseline), but the the "ratelimit update to tg->load_avg" patch
from Aaron Lu [2] does not seem to offer any speed up.

The values "1 migration" and the 2ms window size were determined
empirically with the hackbench benchmark on the targeted hardware.

I would be interested to hear feedback about performance impact of this
patch (improvement or regression) on other workloads and hardware,
especially for Intel CPUs.

Link: https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@xxxxxxx
Link: https://lore.kernel.org/lkml/20230725193048.124796-1-mathieu.desnoyers@xxxxxxxxxxxx/
Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@xxxxxxxxxxxx/
Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@xxxxxxxxxxxx/
Link: https://lore.kernel.org/lkml/f6dc1652-bc39-0b12-4b6b-29a2f9cd8484@xxxxxxx/
Link: https://lore.kernel.org/lkml/20230822113133.643238-1-mathieu.desnoyers@xxxxxxxxxxxx/ [1]
Link: https://lore.kernel.org/lkml/20230823060832.454842-1-aaron.lu@xxxxxxxxx/ [2]
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Valentin Schneider <vschneid@xxxxxxxxxx>
Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
Cc: Ben Segall <bsegall@xxxxxxxxxx>
Cc: Mel Gorman <mgorman@xxxxxxx>
Cc: Daniel Bristot de Oliveira <bristot@xxxxxxxxxx>
Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
Cc: Juri Lelli <juri.lelli@xxxxxxxxxx>
Cc: Swapnil Sapkal <Swapnil.Sapkal@xxxxxxx>
Cc: Aaron Lu <aaron.lu@xxxxxxxxx>
Cc: Julien Desfossez <jdesfossez@xxxxxxxxxxxxxxxx>
Cc: x86@xxxxxxxxxx
---
include/linux/sched.h | 2 ++
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 14 ++++++++++++++
kernel/sched/sched.h | 2 ++
4 files changed, 19 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 177b3f3676ef..1111d04255cc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -564,6 +564,8 @@ struct sched_entity {

u64 nr_migrations;

+ u64 next_migration_time;
+
#ifdef CONFIG_FAIR_GROUP_SCHED
int depth;
struct sched_entity *parent;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 479db611f46e..0d294fce261d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4510,6 +4510,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->se.vruntime = 0;
p->se.vlag = 0;
p->se.slice = sysctl_sched_base_slice;
+ p->se.next_migration_time = 0;
INIT_LIST_HEAD(&p->se.group_node);

#ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d92da2d78774..24ac69913005 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -960,6 +960,14 @@ int sched_update_scaling(void)

static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);

+static bool should_migrate_task(struct task_struct *p, int prev_cpu)
+{
+ /* Rate limit task migration. */
+ if (sched_clock_cpu(prev_cpu) < p->se.next_migration_time)
+ return false;
+ return true;
+}
+
/*
* XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
* this is probably good enough.
@@ -7897,6 +7905,9 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr);
}

+ if (want_affine && !should_migrate_task(p, prev_cpu))
+ return prev_cpu;
+
rcu_read_lock();
for_each_domain(cpu, tmp) {
/*
@@ -7944,6 +7955,9 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
{
struct sched_entity *se = &p->se;

+ /* Rate limit task migration. */
+ se->next_migration_time = sched_clock_cpu(new_cpu) + SCHED_MIGRATION_RATELIMIT_WINDOW;
+
if (!task_on_rq_migrating(p)) {
remove_entity_load_avg(se);

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cf54fe338e23..c9b1a5976761 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -104,6 +104,8 @@ struct cpuidle_state;
#define TASK_ON_RQ_QUEUED 1
#define TASK_ON_RQ_MIGRATING 2

+#define SCHED_MIGRATION_RATELIMIT_WINDOW 2000000 /* 2 ms */
+
extern __read_mostly int scheduler_running;

extern unsigned long calc_load_update;
--
2.39.2