Hi Raghu,
On Tue, Jun 18, 2024 at 12:41:05AM +0530, Raghavendra K T wrote:
On 6/14/2024 10:26 AM, Chen Yu wrote:
From: Yujie Liu <yujie.liu@xxxxxxxxx>
Problem statement:
Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic"), the
Numa vma scan overhead has been reduced a lot. Meanwhile, it could be
a double-sword that, the reducing of the vma scan might create less Numa
page fault information. The insufficient information makes it harder for
the Numa balancer to make decision. Later,
commit b7a5b537c55c08 ("sched/numa: Complete scanning of partial VMAs
regardless of PID activity") and commit 84db47ca7146d7 ("sched/numa: Fix
mm numa_scan_seq based unconditional scan") are found to bring back part
of the performance.
Recently when running SPECcpu on a 320 CPUs/2 Sockets system, a long
duration of remote Numa node read was observed by PMU events. It causes
high core-to-core variance and performance penalty. After the
investigation, it is found that many vmas are skipped due to the active
PID check. According to the trace events, in most cases, vma_is_accessed()
returns false because both pids_active[0] and pids_active[1] have been
cleared.
Thank you for reporting this and also giving potential fix.
I do think this is a good fix to start with.
Thanks a lot for your valuable comments and suggestions.
As an experiment, if the vma_is_accessed() is hacked to always return true,
the long duration remote Numa access is gone.
Proposal:
The main idea is to adjust vma_is_accessed() to let it return true easier.
solution 1 is to extend the pids_active[] from 2 to N, which has already
been proposed by Peter[1]. And how to decide N needs investigation.
I am curious if this (PeterZ's suggestion) implementation in PATCH1 of
link:
https://lore.kernel.org/linux-mm/cover.1710829750.git.raghavendra.kt@xxxxxxx/
get some benefit. I did not see good usecase at that point. but worth a
try to see if it improves performance in your case.
PATCH1 extends the array size of pids_active[] from 2 to 4, so the
history info can be kept for a longer time, but it is possible that the
scanning could still be missed after the task sleeps for a long enough
time. It seems that the N could be task-specific rather than a fixed
value.
Anyway, we will test PATCH1 to see if it helps in our benchmark and
come back later.
solution 2 is to compare the diff between mm->numa_scan_seq and
vma->numab_state->prev_scan_seq. If the diff has exceeded the threshold,
scan the vma.
solution 2 can be used to cover process-based workload(SPECcpu eg). The
reason is: There is only 1 thread within this process. If this process
access the vma at the beginning, then sleeps for a long time, the
pid_active array will be cleared. When this process is woken up, it will
never get a chance to set prot_none anymore. Because only the first 2
times of access is regarded as accessed:
(current->mm->numa_scan_seq) - vma->numab_state->start_scan_seq) < 2
and no other threads can help set this prot_none.
To Summarize: (just thinking loud on the problem IIUC)
The issue overall is, we are not handling the scanning of a single
(fewer) thread task that sleeps or inactive) some time adequately.
one solution is to unconditionally return true (in a way inversely
proportional to number of threads in a task).
But,
1. Does it regress single (or fewer) threaded tasks which does
not really need aggressive scanning.
We haven't observed such regression so far as we don't have a suitable
workload that can well represent the scenario of "tasks that do not
need aggressive scanning."
In theory, it will bring extra scanning overhead, but the penalty of
missing the necessary scanning for the tasks that do need to be migrated
may be more serious since it can result in long time remote node memory
access. This is more likely a trade-off between the scanning cost and
coverage.
2. Are we able to address the issue for multi threaded tasks which
show similar kind of pattern (viz., inactive for some duration regularly).
Theoretically it should do. If multi-threads access different VMAs of
their own, like autonuma bench THREAD_LOCAL, each thread can only help
itself to do the pg_none tagging. We have observed slight performance
improvement with this patch applied when running autonuma bench
THREAD_LOCAL.
In common use cases, tasks with multiple threads are likely to share
some vmas, so there could be higher chance that other threads help tag
the pg_none for the current thread, thus we can tolerate more vma skip,
and vice versa.