[PATCH v3 00/12] sched: consolidation of cpu_power

From: Vincent Guittot
Date: Mon Jun 30 2014 - 12:06:50 EST


Part of this patchset was previously part of the larger tasks packing patchset
[1]. I have splitted the latter in 3 different patchsets (at least) to make the
thing easier.
-configuration of sched_domain topology [2]
-update and consolidation of cpu_power (this patchset)
-tasks packing algorithm

SMT system is no more the only system that can have a CPUs with an original
capacity that is different from the default value. We need to extend the use of
cpu_power_orig to all kind of platform so the scheduler will have both the
maximum capacity (cpu_power_orig/power_orig) and the current capacity
(cpu_power/power) of CPUs and sched_groups. A new function arch_scale_cpu_power
has been created and replace arch_scale_smt_power, which is SMT specifc in the
computation of the capapcity of a CPU.

During load balance, the scheduler evaluates the number of tasks that a group
of CPUs can handle. The current method assumes that tasks have a fix load of
SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_POWER_SCALE.
This assumption generates wrong decision by creating ghost cores and by
removing real ones when the original capacity of CPUs is different from the
default SCHED_POWER_SCALE. We don't try anymore to evaluate the number of
available cores based on the group_capacity but instead we detect when the group
is fully utilized

Now that we have the original capacity of CPUS and their activity/utilization,
we can evaluate more accuratly the capacity and the level of utilization of a
group of CPUs.

This patchset mainly replaces the old capacity method by a new one and has kept
the policy almost unchanged whereas we could certainly take advantage of this
new statistic in several other places of the load balance.

Tests results:
I have put below results of 3 kind of tests:
- hackbench -l 500 -s 4096
- scp of 100MB file on the platform
- ebizzy with various number of threads
on 3 kernel

tip = tip/sched/core
patch = tip + this patchset
patch+irq = tip + this patchset + irq accounting

each test has been run 6 times and the figure below show the stdev and the
diff compared to the tip kernel

Dual cortex A7 tip | patch | patch+irq
stdev | diff stdev | diff stdev
hackbench (+/-)1.02% | +0.42%(+/-)1.29% | -0.65%(+/-)0.44%
scp (+/-)0.41% | +0.18%(+/-)0.10% | +78.05%(+/-)0.70%
ebizzy -t 1 (+/-)1.72% | +1.43%(+/-)1.62% | +2.58%(+/-)2.11%
ebizzy -t 2 (+/-)0.42% | +0.06%(+/-)0.45% | +1.45%(+/-)4.05%
ebizzy -t 4 (+/-)0.73% | +8.39%(+/-)13.25% | +4.25%(+/-)10.08%
ebizzy -t 6 (+/-)10.30% | +2.19%(+/-)3.59% | +0.58%(+/-)1.80%
ebizzy -t 8 (+/-)1.45% | -0.05%(+/-)2.18% | +2.53%(+/-)3.40%
ebizzy -t 10 (+/-)3.78% | -2.71%(+/-)2.79% | -3.16%(+/-)3.06%
ebizzy -t 12 (+/-)3.21% | +1.13%(+/-)2.02% | -1.13%(+/-)4.43%
ebizzy -t 14 (+/-)2.05% | +0.15%(+/-)3.47% | -2.08%(+/-)1.40%

uad cortex A15 tip | patch | patch+irq
stdev | diff stdev | diff stdev
hackbench (+/-)0.55% | -0.58%(+/-)0.90% | +0.62%(+/-)0.43%
scp (+/-)0.21% | -0.10%(+/-)0.10% | +5.70%(+/-)0.53%
ebizzy -t 1 (+/-)0.42% | -0.58%(+/-)0.48% | -0.29%(+/-)0.18%
ebizzy -t 2 (+/-)0.52% | -0.83%(+/-)0.20% | -2.07%(+/-)0.35%
ebizzy -t 4 (+/-)0.22% | -1.39%(+/-)0.49% | -1.78%(+/-)0.67%
ebizzy -t 6 (+/-)0.44% | -0.78%(+/-)0.15% | -1.79%(+/-)1.10%
ebizzy -t 8 (+/-)0.43% | +0.13%(+/-)0.92% | -0.17%(+/-)0.67%
ebizzy -t 10 (+/-)0.71% | +0.10%(+/-)0.93% | -0.36%(+/-)0.77%
ebizzy -t 12 (+/-)0.65% | -1.07%(+/-)1.13% | -1.13%(+/-)0.70%
ebizzy -t 14 (+/-)0.92% | -0.28%(+/-)1.25% | +2.84%(+/-)9.33%

I haven't been able to fully test the patchset for a SMT system to check that
the regression that has been reported by Preethi has been solved but the
various tests that i have done, don't show any regression so far.
The correction of SD_PREFER_SIBLING mode and the use of the latter at SMT level
should have fix the regression.

Change since V2:
- rebase on top of capacity renaming
- fix wake_affine statistic update
- rework nohz_kick_needed
- optimize the active migration of a task from CPU with reduced capacity
- rename group_activity by group_utilization and remove unused total_utilization
- repair SD_PREFER_SIBLING and use it for SMT level
- reorder patchset to gather patches with same topics

Change since V1:
- add 3 fixes
- correct some commit messages
- replace capacity computation by activity
- take into account current cpu capacity

[1] https://lkml.org/lkml/2013/10/18/121
[2] https://lkml.org/lkml/2014/3/19/377


Vincent Guittot (12):
sched: fix imbalance flag reset
sched: remove a wake_affine condition
sched: fix avg_load computation
sched: Allow all archs to set the power_orig
ARM: topology: use new cpu_power interface
sched: add per rq cpu_power_orig
sched: test the cpu's capacity in wake affine
sched: move cfs task on a CPU with higher capacity
Revert "sched: Put rq's sched_avg under CONFIG_FAIR_GROUP_SCHED"
sched: get CPU's utilization statistic
sched: replace capacity_factor by utilization
sched: add SD_PREFER_SIBLING for SMT level

arch/arm/kernel/topology.c | 4 +-
kernel/sched/core.c | 3 +-
kernel/sched/fair.c | 290 +++++++++++++++++++++++----------------------
kernel/sched/sched.h | 5 +-
4 files changed, 158 insertions(+), 144 deletions(-)

--
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/