[RFC PATCH 0/3] Scheduler Soft Affinity

From: subhra mazumdar
Date: Wed Jun 26 2019 - 18:53:10 EST


When multiple instances of workloads are consolidated in same host it is
good practice to partition them for best performance. For e.g give a NUMA
node parition to each instance. Currently Linux kernel provides two
interfaces to hard parition: sched_setaffinity system call or cpuset.cpus
cgroup. But this doesn't allow one instance to burst out of its partition
and use available CPUs from other partitions when they are idle. Running
all instances free range without any affinity, on the other hand, suffers
from cache coherence overhead across sockets (NUMA nodes) when all
instances are busy.

To achieve the best of both worlds, one potential way is to use AutoNUMA
balancer which migrates memory and threads to align them on same NUMA node
when all instances are busy. But it doesn't work if memory is spread across
NUMA nodes, has high reaction time due to periodic scanning mechanism and
can't handle sub-NUMA levels. Some motivational experiments were done with
2 DB instances running on a 2 socket x86 Intel system with 22 cores per
socket. 'numactl -m' was used to bind the memory of each instance to one
NUMA node, the idea was to have AutoNUMA migrate only threads as memory is
pinned. But AutoNUMA ON vs OFF didn't make any difference in performance.
Also it was found that AutoNUMA still migrated pages across NUMA nodes, so
numactl only controls the initial allocation of memory and AutoNUMA is free
to migrate them later. Following are the vmstats for different number of
users running TPC-C in each DB instance with numactl and AutoNUMA ON. With
AutnoNUMA OFF the page migrations were of course zero.

users (2x16)
numa_hint_faults 1672485
numa_hint_faults_local 1158283
numa_pages_migrated 373670

users (2x24)
numa_hint_faults 2267425
numa_hint_faults_local 1548501
numa_pages_migrated 586473


users (2x32)
numa_hint_faults 1916625
numa_hint_faults_local 1499772
numa_pages_migrated 229581

Given the above drawbacks, the most logical way to achieve the desired
behavior is via the task scheduler. A new interface is added, a new system
call sched_setaffinity2 in this case, to specify the set of soft affinity
CPUs. It takes an extra paremeter to specify hard or soft affinity, where
hard behaves same as existing sched_setaffinity. I am open to using other
interfaces like cgroup or anything else I might not have considered. Also
this patchset only allows it for CFS class threads as for RT class the
preferential search latency may not be tolerated. But nothing in theory
stops us from implementing soft affinity for RT class too. Finally it
also adds new scheduler tunables to tune the "softness" of soft affinity
as different workload may have different optimal points. This is done
using two tunables sched_allowed and sched_preferred, where if the ratio
of CPU utilization of the prefrred set and allowed set crosses the ratio
sched_allowed:sched_preferred, scheduler will use the entire allowed set
instead of the preferred set in the first level of search in
select_task_rq_fair. The default value is 100:1.

Following are the performance results of running 2 and 1 instance(s) of
Hackbench and Oracle DB on 2 socket Intel x86 system with 22 cores per
socket with the default tunable settings. For 2 instance case DB shows
substantial improvement over no affinity but Hackbench shows negligible
improvement. For 1 instance case DB performance is close to no affinity,
but Hackbench has significant regression. Hard affinity numbers are also
added for comparison. The load in each Hackbench and DB instance is varied
by varying the number of groups and users respectively. %gain is w.r.t
no affinity.

(100:1)
Hackbench %gain with soft affinity %gain with hard affinity
2*4 1.12 1.3
2*8 1.67 1.35
2*16 1.3 1.12
2*32 0.31 0.61
1*4 -18 -58
1*8 -24 -59
1*16 -33 -72
1*32 -38 -83

DB %gain with soft affinity %gain with hard affinity
2*16 4.26 4.1
2*24 5.43 7.17
2*32 6.31 6.22
1*16 -0.02 0.62
1*24 0.12 -26.1
1*32 -1.48 -8.65

The experiments were repeated with sched_allowed:sched_preferred set to
5:4 to have "softer" soft affinity. Following numbers show it preserves the
(negligible) improvement for 2 instance Hackbench case but reduces the
regression for 1 instance significantly. For DB this setting doesn't work
well as the improvements for 2 instance case goes away. This also shows
different workloads have different optimal setting.

(5:4)
Hackbench %gain with soft affinity
2*4 1.43
2*8 1.36
2*16 1.01
2*32 1.45
1*4 -2.55
1*8 -5.06
1*16 -8
1*32 -7.32

DB %gain with soft affinity
2*16 0.46
2*24 3.68
2*32 -3.34
1*16 0.08
1*24 1.6
1*32 -1.29

Finally I measured the overhead of soft affinity when it is NOT used by
comparing it with baseline kernel in case of no affinity and hard affinity
with Hackbench. The following is the improvement of soft affinity kernel
w.r.t baseline, but really numbers are in noise margin. This shows soft
affinity has no overhead when not used.

Hackbench %diff of no affinity %diff of hard affinity
2*4 0.11 0.31
2*8 0.13 0.55
2*16 0.61 0.90
2*32 0.86 1.01
1*4 0.48 0.43
1*8 0.45 0.33
1*16 0.61 0.64
1*32 0.11 0.63

A final set of experiments were done (numbers not shown) having the memory
of each DB instance spread evenly across both NUMA nodes. This showed
similar improvements with soft affinity for 2 instance case, thus proving
the improvement is due to saving LLC coherence overhead.

subhra mazumdar (3):
sched: Introduce new interface for scheduler soft affinity
sched: change scheduler to give preference to soft affinity CPUs
sched: introduce tunables to control soft affinity

arch/x86/entry/syscalls/syscall_64.tbl | 1 +
include/linux/sched.h | 5 +-
include/linux/sched/sysctl.h | 2 +
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/unistd.h | 4 +-
include/uapi/linux/sched.h | 3 +
init/init_task.c | 2 +
kernel/compat.c | 2 +-
kernel/rcu/tree_plugin.h | 3 +-
kernel/sched/core.c | 167 ++++++++++++++++++++++++++++-----
kernel/sched/fair.c | 154 ++++++++++++++++++++++--------
kernel/sched/sched.h | 2 +
kernel/sysctl.c | 14 +++
13 files changed, 297 insertions(+), 65 deletions(-)

--
2.9.3