Re: [PATCH] sched/numa: scan the vma if it has not been scanned for a while

From: Chen Yu
Date: Mon Jul 15 2024 - 08:54:26 EST


On 2024-06-30 at 23:00:32 +0800, Yujie Liu wrote:
> Problem statement:
> Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic"), the
> Numa vma scan overhead has been reduced a lot. Meanwhile, it could be
> a double-sword that, the reducing of the vma scan might create less Numa
> page fault information. The insufficient information makes it harder for
> the Numa balancer to make decision. Later,
> commit b7a5b537c55c08 ("sched/numa: Complete scanning of partial VMAs
> regardless of PID activity") and commit 84db47ca7146d7 ("sched/numa: Fix
> mm numa_scan_seq based unconditional scan") are found to bring back part
> of the performance.
>
> Recently when running SPECcpu omnetpp_r on a 320 CPUs/2 Sockets system,
> a long duration of remote Numa node read was observed by PMU events:
> A few cores having ~500MB/s remote memory access for ~20 seconds.
> It causes high core-to-core variance and performance penalty. After the
> investigation, it is found that many vmas are skipped due to the active
> PID check. According to the trace events, in most cases, vma_is_accessed()
> returns false because the history access info stored in pids_active
> array has been cleared.
>
> Proposal:
> The main idea is to adjust vma_is_accessed() to let it return true easier.
>
> solution 1 is to extend the pids_active[] from 2 to N, which was proposed
> by Raghavendra[1]. And it is under investigation how to choose the N.
>
> solution 2 is to compare the diff between mm->numa_scan_seq and
> vma->numab_state->prev_scan_seq. If the diff has exceeded the threshold,
> scan the vma.
>
> solution 2 can be used to especially help the cases where there are
> limited number of shared VMAs, the process-based SPECcpu eg. Without
> solution 2, it is possible that, if the single process access the vma
> at the beginning, then sleeps for a long time(the pid_active array
> been cleared), when this process is woken up, it will never get a
> chance to set prot_none anymore. Because only the first 2 times of
> access is regarded as accessed:
> (current->mm->numa_scan_seq) - vma->numab_state->start_scan_seq) < 2
> and no other threads within the task can help set the prot_none.
>
> Raghavendra helped test current patch and got the positive result
> on AMD platform:
>
> autonumabench NUMA01
> base patched
> Amean syst-NUMA01 194.05 ( 0.00%) 165.11 * 14.92%*
> Amean elsp-NUMA01 324.86 ( 0.00%) 315.58 * 2.86%*
>
> Duration User 380345.36 368252.04
> Duration System 1358.89 1156.23
> Duration Elapsed 2277.45 2213.25
>
> autonumabench NUMA02
>
> Amean syst-NUMA02 1.12 ( 0.00%) 1.09 * 2.93%*
> Amean elsp-NUMA02 3.50 ( 0.00%) 3.56 * -1.84%*
>
> Duration User 1513.23 1575.48
> Duration System 8.33 8.13
> Duration Elapsed 28.59 29.71
>
> kernbench
>
> Amean user-256 22935.42 ( 0.00%) 22535.19 * 1.75%*
> Amean syst-256 7284.16 ( 0.00%) 7608.72 * -4.46%*
> Amean elsp-256 159.01 ( 0.00%) 158.17 * 0.53%*
>
> Duration User 68816.41 67615.74
> Duration System 21873.94 22848.08
> Duration Elapsed 506.66 504.55
>
>
> Intel 256 CPUs/2 Sockets:
> autonuma benchmark also shows some improvements:
>
> v6.10-rc5 v6.10-rc5
> +patch
> Amean syst-NUMA01 245.85 ( 0.00%) 230.84 * 6.11%*
> Amean syst-NUMA01_THREADLOCAL 205.27 ( 0.00%) 191.86 * 6.53%*
> Amean syst-NUMA02 18.57 ( 0.00%) 18.09 * 2.58%*
> Amean syst-NUMA02_SMT 2.63 ( 0.00%) 2.54 * 3.47%*
> Amean elsp-NUMA01 517.17 ( 0.00%) 526.34 * -1.77%*
> Amean elsp-NUMA01_THREADLOCAL 99.92 ( 0.00%) 100.59 * -0.67%*
> Amean elsp-NUMA02 15.81 ( 0.00%) 15.72 * 0.59%*
> Amean elsp-NUMA02_SMT 13.23 ( 0.00%) 12.89 * 2.53%*
>
> v6.10-rc5 v6.10-rc5
> +patch
> Duration User 1064010.16 1075416.23
> Duration System 3307.64 3104.66
> Duration Elapsed 4537.54 4604.73
>
> Link: https://lore.kernel.org/lkml/88d16815ef4cc2b6c08b4bb713b25421b5589bc7.1710829750.git.raghavendra.kt@xxxxxxx/ #1
> Reported-by: Xiaoping Zhou <xiaoping.zhou@xxxxxxxxx>
> Co-developed-by: Chen Yu <yu.c.chen@xxxxxxxxx>
> Signed-off-by: Chen Yu <yu.c.chen@xxxxxxxxx>
> Signed-off-by: Yujie Liu <yujie.liu@xxxxxxxxx>
> Reviewed-and-Tested-by: Raghavendra K T <raghavendra.kt@xxxxxxx>
> ---

Hi Peter, Mel,

May I know if this patch is in the right direction? It fixes
a SPECcpu performance regression found recently.

thanks,
Chenyu