Re: [RFC PATCH V2 1/1] sched/numa: Fix disjoint set vma scan regression

From: Bharata B Rao
Date: Fri May 19 2023 - 03:57:30 EST


On 16-May-23 2:49 PM, Raghavendra K T wrote:
> With the numa scan enhancements [1], only the threads which had previously
> accessed vma are allowed to scan.
>
> While this had improved significant system time overhead, there were corner
> cases, which genuinely need some relaxation. For e.g.,
>
> 1) Concern raised by PeterZ, where if there are N partition sets of vmas
> belonging to tasks, then unfairness in allowing these threads to scan could
> potentially amplify the side effect of some of the vmas being left
> unscanned.
>
> 2) Below reports of LKP numa01 benchmark regression.
>
> Currently this was handled by allowing first two scanning unconditional
> as indicated by mm->numa_scan_seq. This is imprecise since for some
> benchmark vma scanning might itself start at numa_scan_seq > 2.
>
> Solution:
> Allow unconditional scanning of vmas of tasks depending on vma size. This
> is achieved by maintaining a per vma scan counter, where
>
> f(allowed_to_scan) = f(scan_counter < vma_size / scan_size)
>
> Fixes: fc137c0ddab2 ("sched/numa: enhance vma scanning logic")
> regression.
>
> Result:
> numa01_THREAD_ALLOC result on 6.4.0-rc1 (that has w/ numascan enhancement)
> base-numascan base base+fix
> real 1m3.025s 1m24.163s 1m3.551s
> user 213m44.232s 251m3.638s 219m55.662s
> sys 6m26.598s 0m13.056s 2m35.767s
>
> numa_hit 5478165 4395752 4907431
> numa_local 5478103 4395366 4907044
> numa_other 62 386 387
> numa_pte_updates 1989274 11606 1265014
> numa_hint_faults 1756059 515 1135804
> numa_hint_faults_local 971500 486 558076
> numa_pages_migrated 784211 29 577728
>
> Summary: Regression in base is recovered by allowing scanning as required.
>
> [1] https://lore.kernel.org/lkml/cover.1677672277.git.raghavendra.kt@xxxxxxx/T/#t
>
> Reported-by: Aithal Srikanth <sraithal@xxxxxxx>
> Reported-by: kernel test robot <oliver.sang@xxxxxxxxx>
> Closes: https://lore.kernel.org/lkml/db995c11-08ba-9abf-812f-01407f70a5d4@xxxxxxx/T/
> Signed-off-by: Raghavendra K T <raghavendra.kt@xxxxxxx>
> ---
> include/linux/mm_types.h | 1 +
> kernel/sched/fair.c | 41 ++++++++++++++++++++++++++++++++--------
> 2 files changed, 34 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 306a3d1a0fa6..992e460a713e 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -479,6 +479,7 @@ struct vma_numab_state {
> unsigned long next_scan;
> unsigned long next_pid_reset;
> unsigned long access_pids[2];
> + unsigned int scan_counter;
> };
>
> /*
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 373ff5f55884..2c3e17e7fc2f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2931,20 +2931,34 @@ static void reset_ptenuma_scan(struct task_struct *p)
> static bool vma_is_accessed(struct vm_area_struct *vma)
> {
> unsigned long pids;
> + unsigned int vma_size;
> + unsigned int scan_threshold;
> + unsigned int scan_size;
> +
> + pids = vma->numab_state->access_pids[0] | vma->numab_state->access_pids[1];
> +
> + if (test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids))
> + return true;
> +
> + scan_size = READ_ONCE(sysctl_numa_balancing_scan_size);
> + /* vma size in MB */
> + vma_size = (vma->vm_end - vma->vm_start) >> 20;
> +
> + /* Total scans needed to cover VMA */
> + scan_threshold = (vma_size / scan_size);
> +
> /*
> - * Allow unconditional access first two times, so that all the (pages)
> - * of VMAs get prot_none fault introduced irrespective of accesses.
> + * Allow the scanning of half of disjoint set's VMA to induce
> + * prot_none fault irrespective of accesses.
> * This is also done to avoid any side effect of task scanning
> * amplifying the unfairness of disjoint set of VMAs' access.
> */
> - if (READ_ONCE(current->mm->numa_scan_seq) < 2)
> - return true;
> -
> - pids = vma->numab_state->access_pids[0] | vma->numab_state->access_pids[1];
> - return test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids);
> + scan_threshold = 1 + (scan_threshold >> 1);
> + return (READ_ONCE(vma->numab_state->scan_counter) <= scan_threshold);
> }
>
> -#define VMA_PID_RESET_PERIOD (4 * sysctl_numa_balancing_scan_delay)
> +#define VMA_PID_RESET_PERIOD (4 * sysctl_numa_balancing_scan_delay)
> +#define DISJOINT_VMA_SCAN_RENEW_THRESH 16
>
> /*
> * The expensive part of numa migration is done from task_work context.
> @@ -3058,6 +3072,8 @@ static void task_numa_work(struct callback_head *work)
> /* Reset happens after 4 times scan delay of scan start */
> vma->numab_state->next_pid_reset = vma->numab_state->next_scan +
> msecs_to_jiffies(VMA_PID_RESET_PERIOD);
> +
> + WRITE_ONCE(vma->numab_state->scan_counter, 0);
> }
>
> /*
> @@ -3068,6 +3084,13 @@ static void task_numa_work(struct callback_head *work)
> vma->numab_state->next_scan))
> continue;
>
> + /*
> + * For long running tasks, renew the disjoint vma scanning
> + * periodically.
> + */
> + if (mm->numa_scan_seq && !(mm->numa_scan_seq % DISJOINT_VMA_SCAN_RENEW_THRESH))

Don't you need a READ_ONCE() accessor for mm->numa_scan_seq?

Regards,
Bharata.