Re: [sched] 143e1e28cb4: +17.9% aim7.jobs-per-min, -9.7% hackbench.throughput

From: Peter Zijlstra
Date: Mon Aug 11 2014 - 09:34:17 EST


On Sun, Aug 10, 2014 at 06:54:13PM +0800, Fengguang Wu wrote:
> This view may be easier to read, by grouping the metrics by test case.
>
> test case: brickland1/aim7/6000-page_test

OK, I have a similar system to the brickland thing (slightly different
configuration, but should be close enough).

Now; do you have a description of each test-case someplace? In
particular, it might be good to have a small annotation to show which
direction is better.

>
> 128529 ± 1% +17.9% 151594 ± 0% TOTAL aim7.jobs-per-min

jobs per minute, + is better, so no worries there.

> 582269 ±14% -55.6% 258617 ±16% TOTAL softirqs.SCHED
> 993654 ± 2% -19.9% 795962 ± 3% TOTAL softirqs.RCU
> 15865125 ± 1% -15.0% 13485882 ± 1% TOTAL softirqs.TIMER

> 59366697 ± 3% -46.1% 32017187 ± 7% TOTAL cpuidle.C1-IVT.time
> 54543 ±11% -37.2% 34252 ±16% TOTAL cpuidle.C1-IVT.usage
> 19542 ± 9% -38.3% 12057 ± 4% TOTAL cpuidle.C1E-IVT.usage
> 49527464 ± 6% -32.4% 33488833 ± 4% TOTAL cpuidle.C1E-IVT.time
> 76064 ± 3% -32.2% 51572 ± 6% TOTAL cpuidle.C6-IVT.usage

Less idle time; might be good, if the work is cpubound, might be bad if
not; hard to say.

> 2.82 ± 3% +21.9% 3.43 ± 4% TOTAL turbostat.%pc2
> 4.40 ± 2% +22.0% 5.37 ± 4% TOTAL turbostat.%c6
> 15.75 ± 1% -3.4% 15.21 ± 0% TOTAL turbostat.RAM_W

> 3150464 ± 2% -24.2% 2387551 ± 3% TOTAL time.voluntary_context_switches

Typically less ctxsw is better..

> 281 ± 1% -15.1% 238 ± 0% TOTAL time.elapsed_time
> 29294 ± 1% -14.3% 25093 ± 0% TOTAL time.system_time

Less time spend (on presumably the same work) is better

> 4529818 ± 1% -8.8% 4129398 ± 1% TOTAL time.involuntary_context_switches

Less preemptions, also generally better

> 10655 ± 0% +1.4% 10802 ± 0% TOTAL time.percent_of_cpu_this_job_got

Seem an improvement; not sure.

Many more stats.. but from the above it looks like its an overall 'win';
or am I reading the thing wrong?


Now I think I see why this is; we've reduced load balancing frequency
significantly on this machine due to:


-#define SD_SIBLING_INIT (struct sched_domain) { \
- .min_interval = 1, \
- .max_interval = 2, \


-#define SD_MC_INIT (struct sched_domain) { \
- .min_interval = 1, \
- .max_interval = 4, \


-#define SD_CPU_INIT (struct sched_domain) { \
- .min_interval = 1, \
- .max_interval = 4, \


*sd = (struct sched_domain){
.min_interval = sd_weight,
.max_interval = 2*sd_weight,

Which both increased the min and max value significantly for all domains
involved.

That said; I think we might want to do something like the below; I can
imagine decreasing load balancing too much will negatively impact other
workloads.

Maybe slightly modified to make sure the first domain has a min_interval
of 1.

---
kernel/sched/core.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1211575a2208..67ed5d854da1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6049,8 +6049,8 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
sd_flags &= ~TOPOLOGY_SD_FLAGS;

*sd = (struct sched_domain){
- .min_interval = sd_weight,
- .max_interval = 2*sd_weight,
+ .min_interval = max(1, sd_weight/2),
+ .max_interval = sd_weight,
.busy_factor = 32,
.imbalance_pct = 125,

@@ -6076,7 +6076,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
,

.last_balance = jiffies,
- .balance_interval = sd_weight,
+ .balance_interval = max(1, sd_weight/2),
.smt_gain = 0,
.max_newidle_lb_cost = 0,
.next_decay_max_lb_cost = jiffies,

Attachment: pgpk11IrxOBW6.pgp
Description: PGP signature