Re: [RFC] autonuma: Support to scan page table asynchronously
From: Mel Gorman
Date: Tue Apr 14 2020 - 08:06:57 EST
On Tue, Apr 14, 2020 at 04:19:51PM +0800, Huang Ying wrote:
> In current AutoNUMA implementation, the page tables of the processes
> are scanned periodically to trigger the NUMA hint page faults. The
> scanning runs in the context of the processes, so will delay the
> running of the processes. In a test with 64 threads pmbench memory
> accessing benchmark on a 2-socket server machine with 104 logical CPUs
> and 256 GB memory, there are more than 20000 latency outliers that are
> > 1 ms in 3600s run time. These latency outliers are almost all
> caused by the AutoNUMA page table scanning. Because they almost all
> disappear after applying this patch to scan the page tables
> asynchronously.
>
> Because there are idle CPUs in system, the asynchronous running page
> table scanning code can run on these idle CPUs instead of the CPUs the
> workload is running on.
>
> So on system with enough idle CPU time, it's better to scan the page
> tables asynchronously to take full advantages of these idle CPU time.
> Another scenario which can benefit from this is to scan the page
> tables on some service CPUs of the socket, so that the real workload
> can run on the isolated CPUs without the latency outliers caused by
> the page table scanning.
>
> But it's not perfect to scan page tables asynchronously too. For
> example, on system without enough idle CPU time, the CPU time isn't
> scheduled fairly because the page table scanning is charged to the
> workqueue thread instead of the process/thread it works for. And
> although the page tables are scanned for the target process, it may
> run on a CPU that is not in the cpuset of the target process.
>
> One possible solution is to let the system administrator to choose the
> better behavior for the system via a sysctl knob (implemented in the
> patch). But it's not perfect too. Because every user space knob adds
> maintenance overhead.
>
> A better solution may be to back-charge the CPU time to scan the page
> tables to the process/thread, and find a way to run the work on the
> proper cpuset. After some googling, I found there's some discussion
> about this as in the following thread,
>
> https://lkml.org/lkml/2019/6/13/1321
>
> So this patch may be not ready to be merged by upstream yet. It
> quantizes the latency outliers caused by the page table scanning in
> AutoNUMA. And it provides a possible way to resolve the issue for
> users who cares about it. And it is a potential customer of the work
> related to the cgroup-aware workqueue or other asynchronous execution
> mechanisms.
>
The caveats you list are the important ones and the reason why it was
not done asynchronously. In an earlier implementation all the work was
done by a dedicated thread and ultimately abandoned.
There is no guarantee there is an idle CPU available and one that is
local to the thread that should be doing the scanning. Even if there is,
it potentially prevents another task from scheduling on an idle CPU and
similarly other workqueue tasks may be delayed waiting on the scanner. The
hiding of the cost is also problematic because the CPU cost is hidden
and mixed with other unrelated workqueues. It also has the potential
to mask bugs. Lets say for example there is a bug whereby a task is
scanning excessively, that can be easily missed when the work is done by
a workqueue.
While it's just an opinion, my preference would be to focus on reducing
the cost and amount of scanning done -- particularly for threads. For
example, all threads operate on the same address space but there can be
significant overlap where all threads are potentially scanning the same
areas or regions that the thread has no interest in. One option would be
to track the highest and lowest pages accessed and only scan within
those regions for example. The tricky part is that library pages may
create very wide windows that render the tracking useless but it could
at least be investigated.
--
Mel Gorman
SUSE Labs