[PATCH v2 0/9] sched: Streamline select_task_rq() & select_task_rq_fair()
From: Valentin Schneider
Date: Wed Mar 11 2020 - 14:16:49 EST
I've been staring at select_task_rq_fair() for some time now, and have come
to "hate" the usage of the sd_flag parameter. It is used as both an
indicator of the wakeup type (ttwu/fork/exec) and as a domain walk search
target. CFS is the only class doing this, the other classes just need the
wakeup type but get passed a domain flag instead.
This series gets rid of select_task_rq()'s sd_flag parameter and also tries
to optimize select_task_rq_fair().
This is based on tip/sched/core at:
a0f03b617c3b ("sched/numa: Stop an exhastive search if a reasonable swap candidate or idle CPU is found")
Patches
=======
o Patch 1 is a simple dead parameter cleanup
o Patches 2-4 get rid of SD_LOAD_BALANCE
o Patches 5-6 involve getting rid of the sd_flag parameter for
select_task_rq().
o Patch 7 is an extra cleanup in the select_task_rq_fair() region.
o Patches 8-9 split the want_affine vs !want_affine paths of
select_task_rq_fair(), which unearths a small optimization. Sort of a
single patch split in two for the sake of review.
Testing
=======
Testing was done against a slightly older tip/sched/core at:
25ac227a25ac ("sched/fair: Remove wake_cap()")
I got annoyed by the variance in my 500 iteration runs, so I scripted
something to run batches of 5k iterations. It looks pretty stable from one
batch to another. I also stared at some boxplots to convince myself I
wasn't needlessly making things worse - you too can stare at them here [1].
Note: the 'X%' stats are the percentiles, so 50% is the 50th percentile.
Juno r0
-------
2+4 big.LITTLE. SD levels are {MC, DIE}.
Hackbench
~~~~~~~~~
15000 iterations of
$ hackbench
(lower is better):
| | -PATCH | +PATCH | DELTA |
|-------+----------+----------+--------|
| mean | 0.622235 | 0.618834 | -0.55% |
| std | 0.018121 | 0.017579 | -2.99% |
| min | 0.571000 | 0.573000 | +0.35% |
| 50% | 0.620000 | 0.617000 | -0.48% |
| 75% | 0.633000 | 0.629000 | -0.63% |
| 99% | 0.674000 | 0.669000 | -0.74% |
| max | 0.818000 | 0.756000 | -7.58% |
The boxplot shows a single outlier to the very left for both commits, which
are the minimums reported above. Other than that, +PATCH has somewhat lower
outliers on the righthand side: worst cases are a tad better.
Sysbench
~~~~~~~~
15000 iterations of
$ sysbench --max-time=5 --max-requests=-1 --test=threads --num-threads=6 run
(higher is better):
| | -PATCH | +PATCH | DELTA |
|-------+--------------+--------------+---------|
| mean | 15318.954000 | 15628.416933 | +2.02% |
| std | 235.466202 | 205.162730 | -12.87% |
| min | 13025.000000 | 13554.000000 | +4.06% |
| 50% | 15366.000000 | 15681.000000 | +2.05% |
| 75% | 15497.000000 | 15765.000000 | +1.73% |
| 99% | 15651.000000 | 15893.000000 | +1.55% |
| max | 15716.000000 | 15972.000000 | +1.63% |
That's overall a tad better.
Dual-socket Xeon E5
-------------------
Each socket is 10 cores w/ SMT2 - 40 CPUs total. SD levels are
{SMT, MC, NUMA}.
Hackbench
~~~~~~~~~
25000 iterations of
$ hackbench -l 1000
(lower is better):
| | -PATCH | +PATCH | DELTA |
|-------+--------------+--------------+--------|
| mean | 0.946312 | 0.944685 | -0.17% |
| std | 0.006419 | 0.006447 | +0.44% |
| min | 0.906000 | 0.897000 | -0.99% |
| 50% | 0.947000 | 0.945000 | -0.21% |
| 75% | 0.950000 | 0.949000 | -0.11% |
| 99% | 0.959000 | 0.958000 | -0.10% |
| max | 0.988000 | 0.967000 | -2.13% |
The boxplot shows that the min improvement is some sort of fluke - it's a
single point standing out on the left. The mean *is* slightly lowered,
which most likely comes from +PATCH having less high-value outliers.
I looked into some statistical tests, but my samples distribution isn't a
normal distribution (which is a requirement for most of them). This
actually can't happen by construction according to [2], since hackbench
outputs the maximum of a set of random of variables. We could instead use
the average of all sender/receiver pairs, or even the invidual time taken
per each pair; that being said, I don't think each value produced by a pair
could be seen as independent variables, given that there'll be more > 1
task per CPU.
Wilcoxon's test [3] gives me a p-value of ~1e-182, so there *is* a
significant difference between the two datasets, but it does not say if the
difference is in the mean, variance, or any other parameter of the
distribution.
Sysbench
~~~~~~~~~
25000 iterations of
$ sysbench --max-time=5 --max-requests=-1 --test=threads --num-threads=40 run
(higher is better):
| | -PATCH | +PATCH | DELTA |
|-------+--------------+--------------+---------|
| mean | 23937.937560 | 24280.668640 | +1.43% |
| std | 547.139948 | 484.963639 | -11.36% |
| min | 21526.000000 | 21917.000000 | +1.82% |
| 50% | 24032.000000 | 24374.000000 | +1.42% |
| 75% | 24355.000000 | 24638.000000 | +1.16% |
| 99% | 24869.010000 | 25084.000000 | +0.86% |
| max | 25410.000000 | 25623.000000 | +0.84% |
As with the Juno, that's overall a tad better.
Takeaway
--------
The TL;DR for those fancy stats seems to be: it's hard to say much about
hackbench, and sysbench likes it a bit. The important thing for me is to
not introduce regressions with my stupid change, and AFAICT it is the case.
Links
=====
[1]: https://htmlpreview.github.io/?https://gist.githubusercontent.com/valschneider/433b3772d1776c52214dd05be2ab2f03/raw/316fbd9f774fa381c60731511c881a3360111563/streamline_v2_bplots.html
[2]: https://en.wikipedia.org/wiki/Fisher%E2%80%93Tippett%E2%80%93Gnedenko_theorem
[3]: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wilcoxon.html#scipy.stats.wilcoxon
Revisions
=========
v1 -> v2
--------
o Removed the 'RFC' tag
o Made the sd_flags syctl read-only
o Removed the SD_LOAD_BALANCE flag
o Cleaned up ugly changes thanks to the above
Valentin Schneider (9):
sched/fair: find_idlest_group(): Remove unused sd_flag parameter
sched/debug: Make sd->flags sysctl read-only
sched: Remove checks against SD_LOAD_BALANCE
sched/topology: Kill SD_LOAD_BALANCE
sched: Add WF_TTWU, WF_EXEC wakeup flags
sched: Kill select_task_rq()'s sd_flag parameter
sched/fair: Dissociate wakeup decisions from SD flag value
sched/fair: Split select_task_rq_fair want_affine logic
sched/topology: Define and use shortcut pointers for wakeup sd_flag
scan
include/linux/sched/topology.h | 29 +++++++------
kernel/sched/core.c | 10 ++---
kernel/sched/deadline.c | 4 +-
kernel/sched/debug.c | 2 +-
kernel/sched/fair.c | 75 +++++++++++++++++++---------------
kernel/sched/idle.c | 2 +-
kernel/sched/rt.c | 4 +-
kernel/sched/sched.h | 13 ++++--
kernel/sched/stop_task.c | 2 +-
kernel/sched/topology.c | 43 +++++++++----------
10 files changed, 98 insertions(+), 86 deletions(-)
--
2.24.0