[PATCH 1/4] sched: Only immediately migrate tasks due to interrupts if prev and target CPUs share cache

From: Mel Gorman
Date: Mon Dec 18 2017 - 04:44:39 EST


If waking from an idle CPU due to an interrupt then it's possible that
the waker task will be pulled to wake on the current CPU. Unfortunately,
depending on the type of interrupt and IRQ configuration, there may not
be a strong relationship between the CPU an interrupt was delivered on
and the CPU a task was running on. For example, the interrupts could all
be delivered to CPUs on one particular node due to the machine topology
or IRQ affinity configuration. Another example is an interrupt for an IO
completion which can be delivered to any CPU where there is no guarantee
the data is either cache hot or even local.

This patch was motivated by the observation that an IO workload was
being pulled cross-node on a frequent basis when IO completed. From a
wakeup latency perspective, it's still useful to know that an idle CPU is
immediately available for use but lets only consider an automatic migration
if the CPUs share cache to limit damage due to NUMA migrations. Migrations
may still occur if wake_affine_weight determines it's appropriate.

These are the throughput results for dbench running on ext4 comparing
4.15-rc3 and this patch on a 2-socket machine where interrupts due to IO
completions can happen on any CPU.

4.15.0-rc3 4.15.0-rc3
vanilla lessmigrate
Hmean 1 854.64 ( 0.00%) 865.01 ( 1.21%)
Hmean 2 1229.60 ( 0.00%) 1274.44 ( 3.65%)
Hmean 4 1591.81 ( 0.00%) 1628.08 ( 2.28%)
Hmean 8 1845.04 ( 0.00%) 1831.80 ( -0.72%)
Hmean 16 2038.61 ( 0.00%) 2091.44 ( 2.59%)
Hmean 32 2327.19 ( 0.00%) 2430.29 ( 4.43%)
Hmean 64 2570.61 ( 0.00%) 2568.54 ( -0.08%)
Hmean 128 2481.89 ( 0.00%) 2499.28 ( 0.70%)
Stddev 1 14.31 ( 0.00%) 5.35 ( 62.65%)
Stddev 2 21.29 ( 0.00%) 11.09 ( 47.92%)
Stddev 4 7.22 ( 0.00%) 6.80 ( 5.92%)
Stddev 8 26.70 ( 0.00%) 9.41 ( 64.76%)
Stddev 16 22.40 ( 0.00%) 20.01 ( 10.70%)
Stddev 32 45.13 ( 0.00%) 44.74 ( 0.85%)
Stddev 64 93.10 ( 0.00%) 93.18 ( -0.09%)
Stddev 128 184.28 ( 0.00%) 177.85 ( 3.49%)

Note the small increase in throughput for low thread counts but also
note that the standard deviation for each sample during the test run is
lower. The throughput figures for dbench can be misleading so the benchmark
is actually modified to time the latency of the processing of one load
file with many samples taken. The difference in latency is

4.15.0-rc3 4.15.0-rc3
vanilla lessmigrate
Amean 1 21.71 ( 0.00%) 21.47 ( 1.08%)
Amean 2 30.89 ( 0.00%) 29.58 ( 4.26%)
Amean 4 47.54 ( 0.00%) 46.61 ( 1.97%)
Amean 8 82.71 ( 0.00%) 82.81 ( -0.12%)
Amean 16 149.45 ( 0.00%) 145.01 ( 2.97%)
Amean 32 265.49 ( 0.00%) 248.43 ( 6.42%)
Amean 64 463.23 ( 0.00%) 463.55 ( -0.07%)
Amean 128 933.97 ( 0.00%) 935.50 ( -0.16%)
Stddev 1 1.58 ( 0.00%) 1.54 ( 2.26%)
Stddev 2 2.84 ( 0.00%) 2.95 ( -4.15%)
Stddev 4 6.78 ( 0.00%) 6.85 ( -0.99%)
Stddev 8 16.85 ( 0.00%) 16.37 ( 2.85%)
Stddev 16 41.59 ( 0.00%) 41.04 ( 1.32%)
Stddev 32 111.05 ( 0.00%) 105.11 ( 5.35%)
Stddev 64 285.94 ( 0.00%) 288.01 ( -0.72%)
Stddev 128 803.39 ( 0.00%) 809.73 ( -0.79%)

It's a small improvement which is not surprising given that migrations that
migrate to a different node as not that common. However, it is noticable
in the CPU migration statistics which are reduced by 24%.

Signed-off-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
---
kernel/sched/fair.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2fe3aa853e4d..4a1f7d32ecf6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5701,7 +5701,13 @@ static bool
wake_affine_idle(struct sched_domain *sd, struct task_struct *p,
int this_cpu, int prev_cpu, int sync)
{
- if (idle_cpu(this_cpu))
+ /*
+ * If this_cpu is idle, it implies the wakeup is from interrupt
+ * context. Only allow the move if cache is shared. Otherwise an
+ * interrupt intensive workload could force all tasks onto one
+ * node depending on the IO topology or IRQ affinity settings.
+ */
+ if (idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
return true;

if (sync && cpu_rq(this_cpu)->nr_running == 1)
--
2.15.0