Re: [PATCH 2/4] sched: Allow a wakee to run on the prev_cpu if it is idle and cache-affine with the waker
From: Peter Zijlstra
Date: Mon Dec 18 2017 - 05:26:24 EST
On Mon, Dec 18, 2017 at 09:43:25AM +0000, Mel Gorman wrote:
> With the commit "sched: Only migrate tasks due to interrupts if prev
> and target CPUs share cache", we no longer migrate a task from interrupt
> context if the waker does not share a CPU. However, for a normal wakeup
> from a cache-affine process, we can miss the fact that prev_cpu is idle
> and an appropriate sibling leading to unnecessary searches and migrations.
>
> This patch reworks wake_affine to return a suitable CPU to wake on which
> may be the current or prev CPU. If wake_affine_idle returns prev due to it
> being idle then select_idle_sibling will immediately return the prev_cpu
> without searching. It's slightly mixed on dbench using ext4 with gains when the machine is lightly
> loaded and a small regression borderline on the noise when more than a node's
> worth of CPU is used.
>
> 4.15.0-rc3 4.15.0-rc3
> noirq wakeprev
> Hmean 1 865.01 ( 0.00%) 834.19 ( -3.56%)
> Hmean 2 1274.44 ( 0.00%) 1353.09 ( 6.17%)
> Hmean 4 1628.08 ( 0.00%) 1714.82 ( 5.33%)
> Hmean 8 1831.80 ( 0.00%) 1855.84 ( 1.31%)
> Hmean 16 2091.44 ( 0.00%) 1975.40 ( -5.55%)
> Hmean 32 2430.29 ( 0.00%) 2298.58 ( -5.42%)
> Hmean 64 2568.54 ( 0.00%) 2536.56 ( -1.25%)
> Hmean 128 2499.28 ( 0.00%) 2543.81 ( 1.78%)
> Stddev 1 5.35 ( 0.00%) 19.39 (-262.63%)
> Stddev 2 11.09 ( 0.00%) 4.88 ( 55.97%)
> Stddev 4 6.80 ( 0.00%) 9.24 ( -35.93%)
> Stddev 8 9.41 ( 0.00%) 28.39 (-201.82%)
> Stddev 16 20.01 ( 0.00%) 44.92 (-124.56%)
> Stddev 32 44.74 ( 0.00%) 50.14 ( -12.07%)
> Stddev 64 93.18 ( 0.00%) 84.97 ( 8.81%)
> Stddev 128 177.85 ( 0.00%) 178.00 ( -0.09%)
>
> However, system CPU usage is noticably reduced
>
> 4.15.0-rc3 4.15.0-rc3
> noirq wakeprev
> User 1058.32 1077.42
> System 5729.22 5287.61
> Elapsed 1550.69 1553.09
>
> Signed-off-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
> ---
> kernel/sched/fair.c | 70 ++++++++++++++++++++++++++++++++++++++---------------
> 1 file changed, 51 insertions(+), 19 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4a1f7d32ecf6..392e08b364bd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5689,17 +5689,21 @@ static int wake_wide(struct task_struct *p)
> * soonest. For the purpose of speed we only consider the waking and previous
> * CPU.
> *
> - * wake_affine_idle() - only considers 'now', it check if the waking CPU is (or
> - * will be) idle.
> + * wake_affine_idle() - only considers 'now', it checks if a CPU that is
> + * cache-affine with the waker is idle
> + *
This bit belongs in the previous patch I'm thinking.
> + * wake_affine_sync() - only considers 'now', it checks if the waking CPU
> + * will be idle. Migrations to a different NUMA node
> + * are allowed on the basis that sync wakeups imply
> + * shared data between waker and wakee.
And it would be nice if we can rework the return value thing in a
separate patch from adding that affine_sync thing, and then munge your
3rd patch along wiht the patch that introduces it.
Did you run these patches on more than just dbench? In specific I
suppose the schbench stuff from facebook would be interesting. Also that
NAS-lu benchmark.