[PATCH 00/13] High performance balancing logic for big.LITTLE

From: Arseniy Krasnov
Date: Fri Nov 06 2015 - 07:03:11 EST


Prologue

The patch set introduces an extension to the default Linux CPU scheduler (CFS).
The main purpose of the extension is utilization of a big.LITTLE CPU for maximum
performance. Such solution may be useful for users of OdroidXU-3 board
(supporting 8 cores) who doesn't care about power efficiency.

Maximum utilization was reached using the following policies:

1) A15 cores must be utilized as much as possible e.g. idle A15 cores always
pull some task from A7 core.

2) After execution of a task on A7 core for some period of time it should be
swapped with an appropriate task from A15 cluster in order to achieve
fairness.

3) Load of big and little clusters is balanced according to frequency and
A15/A7 slowdown coefficient.

Approach Description

The scheduler creates a hierarchy of two domains: MC and HMP. The MC domain is a
default domain for SCHED_MC config. The HMP domain contains two clusters: A15
and A7 CPUs. Balancing between HMP domains is performed by the new logic, in MC
domains, in turn, balancing is done by the default logic of the 'load_balance()'
function.

To perform balancing between HMP domains, the load of each cluster is calculated
in scheduler's softirq handler. Then, this value is scaled according to each
cluster's frequency and slowdown coefficient which is a ratio of busy-loop
performance on A15 and A7. There are three ways of migration between two
clusters: from A15 cluster to A7 cluster (if load on A15 cluster is too high),
from A7 cluster to A15 cluster (otherwise) and task swapping when load on both
clusters is the same. To migrate some task from one cluster to another firstly
this task should be selected. To find a task suitable for migration the
scheduler uses a special per-task metric called 'druntime'. It is based on CFS's
vruntime metric but its grow direction depends on a core where the task is
executed: for A15 core it grows up, for A7 core, in turn, it goes down. So,
being the druntime value close to zero means that the task is executed on both
clusters for the same amount of time. As a result, to get a task for migration
it scans each runqueue to find a task with highest/lowest druntime depending on
which cluster is scanned; after, when the task is found, it is moved to another
cluster. These balancing steps are performed in each scheduler balancing
operation executed by softirq.

To get maximum performance A15 cores must be fully utilized; this means that
idle A15 cores are always able to pull tasks from A7 cores while A7 cores cannot
do that from A15 cores.

An finally, let's look to fairness - it is provided by swapping of tasks during
every softirq balancing: when balance is broken it tries to repair the balance
moving tasks from one cluster to another, then when the clusters are balanced,
the tasks are swapped during each softirq balancing. In addition to this logic,
'select_task_rq_fair' was modified in order to place woken tasks to least loaded
CPU, because it won't break the balance between A15 and A7 cores.

Test results

Several test kits were used for performance measurement of the solution.
All comparision is done against the Linaro MP scheduler.

The first test case is a parsec benchmark suite. It contains different types of
tasks like cluster searching or pattern recognition in order to test scheduler
performance. Results of some benchmarks are listed in the text below (in
seconds):

Streamcluster:

Developed by Princeton University and solves the online clustering problem.
Streamcluster was included in the PARSEC benchmark suite because of the
importance of data mining algorithms and the prevalence of problems with
streaming characteristics.

Threads HPERF_HMP Linaro MP
1 27,333 27,422
2 14,162 14,197
3 10,099 10,168
4 8,227 8,332
5 10,922 23,349
6 10,85 22,507
7 11,39 22,041
8 12,307 21,181
9 20,339 22,115
10 21,33 23,746
11 23,289 24,831
12 25,363 26,699
13 34,091 34,84
14 34,758 38,661
15 35,743 38,688
16 38,1 44,735
17 41,165 77,098
18 44,223 102,633
19 46,177 113,748
20 48,22 119,146
21 52,372 135,499
22 54,319 136,454
23 56,218 141,924
24 57,843 145,727
25 61,759 158,754
26 63,179 163,915
27 64,987 167,559
28 67,329 171,203
29 70,489 185,171
30 73,084 189,303
31 75,264 192,487
32 77,015 197,27
avg 40,373 87,543

Bodytrack:

This computer vision application is an Intel RMS workload which tracks a human
body with multiple cameras through an image sequence. This benchmark was
included due to the increasing significance of computer vision algorithms in
areas such as video surveillance, character animation and computer interfaces.

Threads HPERF_HMP Linaro MP
1 15,884 16,632
2 8,536 9,42
3 6,037 7,257
4 4,84 6,076
5 8,835 5,739
6 4,437 5,513
7 4,119 5,474
8 3,992 5,115
9 3,854 5,164
10 3,92 4,911
11 3,854 4,932
12 3,83 4,816
13 3,839 5,643
14 3,861 4,816
15 3,889 4,896
16 3,845 4,854
17 3,872 4,837
18 3,852 4,876
19 4,304 4,868
20 3,915 4,928
21 3,87 4,841
22 3,858 4,995
23 3,881 4,97
24 3,876 4,899
25 3,854 4,96
26 3,869 4,902
27 3,874 4,979
28 3,88 4,928
29 3,914 5,008
30 3,889 5,216
31 3,898 5,242
32 3,894 5,199
avg 4,689 5,653

Blackscholes:

This application is an Intel RMS benchmark. It calculates the prices for a
portfolio of European options analytically with the Black-Scholes partial
differential equation. There is no closed-form expression for the blackscholes
equation and as such it must be computed numerically.

Threads HPERF_HMP Linaro MP
1 7,293 6,807
2 3,886 4,044
3 2,906 2,911
4 2,429 2,427
5 2,58 2,985
6 2,401 2,672
7 2,205 2,411
8 2,132 2,293
9 2,074 2,41
10 2,067 2,264
11 2,054 2,205
12 2,091 2,222
13 2,042 2,28
14 2,035 2,222
15 2,026 2,25
16 2,024 2,177
17 2,021 2,173
18 2,033 2,09
19 2,03 2,05
20 2,024 2,158
21 2,002 2,175
22 2,026 2,179
23 2,017 2,134
24 2,01 2,156
25 2,009 2,155
26 2,013 2,179
27 2,017 2,177
28 2,019 2,189
29 2,013 2,158
30 2,002 2,162
31 2,016 2,16
32 2,012 2,159
avg 2,328 2,469

Also, well known Antutu benchmark was executed on Exynos 5433 board:

HPERF_HMP Linaro MP
Integral benchmark result 42400 36860
Result: hperf_hmp is 15% better.


Arseniy Krasnov (13):
hperf_hmp: add new config for arm and arm64.
hperf_hmp: introduce hew domain flag.
hperf_hmp: add sched domains initialization.
hperf_hmp: scheduler initialization routines.
hperf_hmp: introduce druntime metric.
hperf_hmp: is_hmp_imbalance introduced.
hperf_hmp: migration auxiliary functions.
hperf_hmp: swap tasks function.
hperf_hmp: one way balancing function.
hperf_hmp: idle pull function.
hperf_hmp: task CPU selection logic.
hperf_hmp: rest of logic.
hperf_hmp: cpufreq routines.

arch/arm/Kconfig | 21 +
arch/arm/kernel/topology.c | 6 +-
arch/arm64/Kconfig | 21 +
include/linux/sched.h | 17 +
kernel/sched/core.c | 65 +-
kernel/sched/fair.c | 1553 ++++++++++++++++++++++++++++++++++++++++----
kernel/sched/sched.h | 16 +
7 files changed, 1586 insertions(+), 113 deletions(-)

--
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/