Re: [PATCH 5/6] sched/numa: Reset scan rate whenever task moves across nodes

From: Ingo Molnar
Date: Mon Sep 10 2018 - 04:48:16 EST



* Srikar Dronamraju <srikar@xxxxxxxxxxxxxxxxxx> wrote:

> Currently task scan rate is reset when numa balancer migrates the task
> to a different node. If numa balancer initiates a swap, reset is only
> applicable to the task that initiates the swap. Similarly no scan rate
> reset is done if the task is migrated across nodes by traditional load
> balancer.
>
> Instead move the scan reset to the migrate_task_rq. This ensures the
> task moved out of its preferred node, either gets back to its preferred
> node quickly or finds a new preferred node. Doing so, would be fair to
> all tasks migrating across nodes.
>
> specjbb2005 / bops/JVM / higher bops are better
> on 2 Socket/2 Node Intel
> JVMS Prev Current %Change
> 4 210118 208862 -0.597759
> 1 313171 307007 -1.96825
>
>
> on 2 Socket/4 Node Power8 (PowerNV)
> JVMS Prev Current %Change
> 8 91027.5 89911.4 -1.22611
> 1 216460 216176 -0.131202
>
>
> on 2 Socket/2 Node Power9 (PowerNV)
> JVMS Prev Current %Change
> 4 191918 196078 2.16759
> 1 207043 214664 3.68088
>
>
> on 4 Socket/4 Node Power7
> JVMS Prev Current %Change
> 8 58462.1 60719.2 3.86079
> 1 108334 112615 3.95167
>
>
> dbench / transactions / higher numbers are better
> on 2 Socket/2 Node Intel
> count Min Max Avg Variance %Change
> 5 11851.8 11937.3 11890.9 33.5169
> 5 12511.7 12559.4 12539.5 15.5883 5.45459
>
>
> on 2 Socket/4 Node Power8 (PowerNV)
> count Min Max Avg Variance %Change
> 5 4791 5016.08 4962.55 85.9625
> 5 4709.28 4979.28 4919.32 105.126 -0.871125
>
>
> on 2 Socket/2 Node Power9 (PowerNV)
> count Min Max Avg Variance %Change
> 5 9353.43 9380.49 9369.6 9.04361
> 5 9388.38 9406.29 9395.1 5.98959 0.272157
>
>
> on 4 Socket/4 Node Power7
> count Min Max Avg Variance %Change
> 5 149.518 215.412 179.083 21.5903
> 5 157.71 184.929 174.754 10.7275 -2.41731
>
> Signed-off-by: Srikar Dronamraju <srikar@xxxxxxxxxxxxxxxxxx>
> ---
> kernel/sched/fair.c | 19 +++++++++++++------
> 1 file changed, 13 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a5936ed..4ea0eff 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1837,12 +1837,6 @@ static int task_numa_migrate(struct task_struct *p)
> if (env.best_cpu == -1)
> return -EAGAIN;
>
> - /*
> - * Reset the scan period if the task is being rescheduled on an
> - * alternative node to recheck if the tasks is now properly placed.
> - */
> - p->numa_scan_period = task_scan_start(p);
> -
> best_rq = cpu_rq(env.best_cpu);
> if (env.best_task == NULL) {
> ret = migrate_task_to(p, env.best_cpu);
> @@ -6361,6 +6355,19 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu __maybe_unus
>
> /* We have migrated, no longer consider this task hot */
> p->se.exec_start = 0;
> +
> +#ifdef CONFIG_NUMA_BALANCING
> + if (!p->mm || (p->flags & PF_EXITING))
> + return;
> +
> + if (p->numa_faults) {
> + int src_nid = cpu_to_node(task_cpu(p));
> + int dst_nid = cpu_to_node(new_cpu);
> +
> + if (src_nid != dst_nid)
> + p->numa_scan_period = task_scan_start(p);
> + }
> +#endif

Please don't add #ifdeffery inside functions, especially not if they do weird flow control like
a 'return' from the middle of a block.

A properly named inline helper would work I suppose.

Thanks,

Ingo