[PATCH v5 0/3] sched/fair: Introduce scaled capacity awareness in enqueue

From: Rohit Jain
Date: Sat Oct 07 2017 - 19:45:43 EST

* Changed the dynamic threshold calculation as the having global state
can be avoided.

* Split up the patch for find_idlest_cpu and select_idle_sibling code

* Rebased it to peterz's tree (apologies for wrong tree for v3)

* Changed the threshold to 768 from 819 for easier shifts
* Changed the find_idlest_cpu code path to be simpler
* Changed the select_idle_core code path to search for
idlest+full_capacity core
* Added scaled capacity awareness to wake_affine_idle code path

During OLTP workload runs, threads can end up on CPUs with a lot of
softIRQ activity, thus delaying progress. For more reliable and
faster runs, if the system can spare it, these threads should be
scheduled on CPUs with lower IRQ/RT activity.

Currently, the scheduler takes into account the original capacity of
CPUs when providing 'hints' for select_idle_sibling code path to return
an idle CPU. However, the rest of the select_idle_* code paths remain
capacity agnostic. Further, these code paths are only aware of the
original capacity and not the capacity stolen by IRQ/RT activity.

This patch introduces capacity awarness in scheduler (CAS) which avoids
CPUs which might have their capacities reduced (due to IRQ/RT activity)
when trying to schedule threads (on the push side) in the system. This
awareness has been added into the fair scheduling class.

It does so by, using the following algorithm:
1) As in rt_avg the scaled capacities are already calculated.

2) Any CPU which is running below 80% capacity is considered running low
on capacity.

3) During idle CPU search if a CPU is found running low on capacity, it
is skipped if better CPUs are available.

4) If none of the CPUs are better in terms of idleness and capacity, then
the low-capacity CPU is considered to be the best available CPU.

The performance numbers:
CAS shows upto 1.5% improvement on x86 when running 'SELECT' database

For microbenchmark results, I used hackbench running with process along
with, running ping on CPU 0,1 and 2 as:
'ping -l 10000 -q -s 10 -f hostX'

The results below should be read as:

* 'Baseline without ping' is how the workload would've behaved if there
was no IRQ activity.

* Compare 'Baseline with ping' and 'Baseline without ping' to see the
effect of ping

* Compare 'Baseline with ping' and 'CAS with ping' to see the improvement
CAS can give over baseline

Following are the runtime(s) with hackbench and ping activity as
described above (lower is better), on a 44 core 2 socket x86 machine:

|Num. |CAS |Baseline|Baseline|
|Tasks |with |with |without |
|(groups of 40) |ping |ping |ping |
| |Mean |Mean |Mean |
|1 | 0.55 | 0.59 | 0.53 |
|2 | 0.66 | 0.81 | 0.51 |
|4 | 0.99 | 1.16 | 0.95 |
|8 | 1.92 | 1.93 | 1.88 |
|16 | 3.24 | 3.26 | 3.15 |
|32 | 5.93 | 5.98 | 5.68 |
|64 | 11.55| 11.94 | 10.89 |

Rohit Jain (3):
sched/fair: Introduce scaled capacity awareness in find_idlest_cpu
code path
sched/fair: Introduce scaled capacity awareness in select_idle_sibling
code path
sched/fair: Introduce scaled capacity awareness in wake_affine_idle
code path

kernel/sched/fair.c | 66 ++++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 53 insertions(+), 13 deletions(-)