Re: [RFC PATCH v3 00/16] Core scheduling v3

From: Julien Desfossez
Date: Wed Jun 12 2019 - 12:38:28 EST


After reading more traces and trying to understand why only untagged
tasks are starving when there are cpu-intensive tasks running on the
same set of CPUs, we noticed a difference in behavior in âpick_taskâ. In
the case where âcore_cookieâ is 0, we are supposed to only prefer the
tagged task if itâs priority is higher, but when the priorities are
equal we prefer it as well which causes the starving. âpick_taskâ is
biased toward selecting its first parameter in case of equality which in
this case was the âclass_pickâ instead of âmaxâ. Reversing the order of
the parameter solves this issue and matches the expected behavior.

So we can get rid of this vruntime_boost concept.

We have tested the fix below and it seems to work well with
tagged/untagged tasks.

Here are our initial test results. When core scheduling is enabled,
each VM (and associated vhost threads) are in their own cgroup/tag.

1 12-vcpu VM MySQL TPC-C benchmark (IO + CPU) with 96 mostly-idle 1-vcpu
VMs on each NUMA node (72 logical CPUs total with SMT on):
+-------------+----------+--------------+------------+--------+
| | baseline | coresched | coresched | nosmt |
| | no tag | VMs tagged | VMs tagged | no tag |
| | v5.1.5 | no stall fix | stall fix | |
+-------------+----------+--------------+------------+--------+
|average TPS | 1474 | 1289 | 1264 | 1339 |
|stdev | 48 | 12 | 17 | 24 |
|overhead | N/A | -12% | -14% | -9% |
+-------------+----------+--------------+------------+--------+

3 12-vcpu VMs running linpack (cpu-intensive), all pinned on the same
NUMA node (36 logical CPUs with SMT enabled on that NUMA node):
+---------------+----------+--------------+-----------+--------+
| | baseline | coresched | coresched | nosmt |
| | no tag | VMs tagged | VMs tagged| no tag |
| | v5.1.5 | no stall fix | stall fix | |
+---------------+----------+--------------+-----------+--------+
|average gflops | 177.9 | 171.3 | 172.7 | 81.9 |
|stdev | 2.6 | 10.6 | 6.4 | 8.1 |
|overhead | N/A | -3.7% | -2.9% | -53.9% |
+---------------+----------+--------------+-----------+--------+

This fix can be toggled dynamically with the âCORESCHED_STALL_FIXâ
sched_feature so itâs easy to test before/after (it is disabled by
default).

The up-to-date git tree can also be found here in case itâs easier to
follow:
https://github.com/digitalocean/linux-coresched/commits/vpillai/coresched-v3-v5.1.5-test

Feedback welcome !

Thanks,

Julien

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6e79421..26fea68 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3668,8 +3668,10 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma
* If class_pick is tagged, return it only if it has
* higher priority than max.
*/
- if (max && class_pick->core_cookie &&
- prio_less(class_pick, max))
+ bool max_is_higher = sched_feat(CORESCHED_STALL_FIX) ?
+ max && !prio_less(max, class_pick) :
+ max && prio_less(class_pick, max);
+ if (class_pick->core_cookie && max_is_higher)
return idle_sched_class.pick_task(rq);

return class_pick;
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 858589b..332a092 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -90,3 +90,9 @@ SCHED_FEAT(WA_BIAS, true)
* UtilEstimation. Use estimated CPU utilization.
*/
SCHED_FEAT(UTIL_EST, true)
+
+/*
+ * Prevent task stall due to vruntime comparison limitation across
+ * cpus.
+ */
+SCHED_FEAT(CORESCHED_STALL_FIX, false)