Re: [RFC PATCH 1/5] sched/fair: record overloaded cpus
From: Abel Wu
Date: Thu Feb 24 2022 - 09:36:56 EST
Hi Gautham, thanks for your comment!
On 2/24/22 3:10 PM, Gautham R. Shenoy wrote:
Hello Abel,
(+ Aubrey Li, Srikar)
On Thu, Feb 17, 2022 at 11:43:57PM +0800, Abel Wu wrote:
An CFS runqueue is considered overloaded when there are
more than one pullable non-idle tasks on it (since sched-
idle cpus are treated as idle cpus). And idle tasks are
counted towards rq->cfs.idle_h_nr_running, that is either
assigned SCHED_IDLE policy or placed under idle cgroups.
The overloaded cfs rqs can cause performance issues to
both task types:
- for latency critical tasks like SCHED_NORMAL,
time of waiting in the rq will increase and
result in higher pct99 latency, and
- batch tasks may not be able to make full use
of cpu capacity if sched-idle rq exists, thus
presents poorer throughput.
The mask of overloaded cpus is updated in periodic tick
and the idle path at the LLC domain basis. This cpumask
will also be used in SIS as a filter, improving idle cpu
searching.
This is an interesting approach to minimise the tail latencies by
keeping track of the overloaded cpus in the LLC so that
idle/sched-idle CPUs can pull from them. This approach contrasts with the
following approaches that were previously tried :
1. Maintain the idle cpumask at the LLC level by Aubrey Li
https://lore.kernel.org/all/1615872606-56087-1-git-send-email-aubrey.li@xxxxxxxxx/
It's a similar approach from different sight in SIS. Both have pros and
cons, and I couldn't tell which one is more appropriate..
While since SIS can fail in finding one idle cpu due to scaling issues,
the sched-idle load balancing might be a valuable supplement to that to
consume the idle/sched-idle cpus.
2. Maintain the identity of the idle core itself at the LLC level, by Srikar :
https://lore.kernel.org/lkml/20210513074027.543926-3-srikar@xxxxxxxxxxxxxxxxxx/
The efforts done by Srikar seems focused on idle core searching, which
has a different goal from my approach I think.
The case of short running tasks pointed out by Vincent should not be a
problem in updating the overloaded cpu mask/counter, since they are not
updated either when cpu becomes busy, or when cpu frequently goes idle
during a tick period.
There have been concerns in the past about having to update the shared
mask/counter at regular intervals. Srikar, Aubrey any thoughts on this
?
I'm afraid I didn't fully catch up with these loops, it is appreciated
if someone can shed some light, thanks!
Signed-off-by: Abel Wu <wuyun.abel@xxxxxxxxxxxxx>
---
include/linux/sched/topology.h | 10 ++++++++++
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 43 ++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 6 ++++++
kernel/sched/topology.c | 4 +++-
5 files changed, 63 insertions(+), 1 deletion(-)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 56cffe42abbc..03c9c81dc886 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -81,6 +81,16 @@ struct sched_domain_shared {
atomic_t ref;
atomic_t nr_busy_cpus;
int has_idle_cores;
+
+ /*
+ * The above varibles are used in idle path and
+ * select_task_rq, and the following two are
+ * mainly updated in tick. They are all hot but
+ * for different usage, so start a new cacheline
+ * to avoid false sharing.
+ */
+ atomic_t nr_overloaded ____cacheline_aligned;
+ unsigned long overloaded[]; /* Must be last */
};
struct sched_domain {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1d863d7f6ad7..a6da2998ec49 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9423,6 +9423,7 @@ void __init sched_init(void)
rq->wake_stamp = jiffies;
rq->wake_avg_idle = rq->avg_idle;
rq->max_idle_balance_cost = sysctl_sched_migration_cost;
+ rq->overloaded = 0;
INIT_LIST_HEAD(&rq->cfs_tasks);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5c4bfffe8c2c..0a0438c3319b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6968,6 +6968,46 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
return newidle_balance(rq, rf) != 0;
}
+
+static inline int cfs_rq_overloaded(struct rq *rq)
+{
+ return rq->cfs.h_nr_running - rq->cfs.idle_h_nr_running > 1;
+}
+
+/* Must be called with rq locked */
+static void update_overload_status(struct rq *rq)
+{
+ struct sched_domain_shared *sds;
+ int overloaded = cfs_rq_overloaded(rq);
+ int cpu = cpu_of(rq);
+
+ lockdep_assert_rq_held(rq);
+
+ if (rq->overloaded == overloaded)
+ return;
+
+ rcu_read_lock();
+ sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+ if (unlikely(!sds))
+ goto unlock;
+
+ if (overloaded) {
+ cpumask_set_cpu(cpu, sdo_mask(sds));
+ atomic_inc(&sds->nr_overloaded);
+ } else {
+ cpumask_clear_cpu(cpu, sdo_mask(sds));
+ atomic_dec(&sds->nr_overloaded);
+ }
+
+ rq->overloaded = overloaded;
+unlock:
+ rcu_read_unlock();
+}
+
+#else
+
+static inline void update_overload_status(struct rq *rq) { }
+
#endif /* CONFIG_SMP */
static unsigned long wakeup_gran(struct sched_entity *se)
@@ -7315,6 +7355,8 @@ done: __maybe_unused;
if (new_tasks > 0)
goto again;
+ update_overload_status(rq);
+
So here, we are calling update_overload_status() after
newidle_balance(). If we had pulled a single task as a part of
newidle_balance(), in your current code, we do not update the overload
status. While this should get remedied in the next tick, should we
move update_overload_status(rq) prior to the new_tasks > 0 check ?
A single task won't change the overloaded status :)
And I think it would be better not do an update even if pulled several
tasks because that would break the rate limit which is undesired.
Best Regards,
Abel
--
Thanks and Regards
gautham.