Re: [PATCH V2 0/6] VA to numa node information

From: Steven Sistare
Date: Mon Nov 26 2018 - 14:21:31 EST


On 11/9/2018 11:48 PM, Prakash Sangappa wrote:
> On 9/24/18 10:14 AM, Michal Hocko wrote:
>> On Fri 14-09-18 12:01:18, Steven Sistare wrote:
>>> On 9/14/2018 1:56 AM, Michal Hocko wrote:
>> [...]
>>>> Why does this matter for something that is for analysis purposes.
>>>> Reading the file for the whole address space is far from a free
>>>> operation. Is the page walk optimization really essential for usability?
>>>> Moreover what prevents move_pages implementation to be clever for the
>>>> page walk itself? In other words why would we want to add a new API
>>>> rather than make the existing one faster for everybody.
>>> One could optimize move pages. If the caller passes a consecutive range
>>> of small pages, and the page walk sees that a VA is mapped by a huge page,
>>> then it can return the same numa node for each of the following VA's that fall
>>> into the huge page range. It would be faster than 55 nsec per small page, but
>>> hard to say how much faster, and the cost is still driven by the number of
>>> small pages.
>> This is exactly what I was arguing for. There is some room for
>> improvements for the existing interface. I yet have to hear the explicit
>> usecase which would required even better performance that cannot be
>> achieved by the existing API.
>>
>
> Above mentioned optimization to move_pages() API helps when scanning
> mapped huge pages, but does not help if there are large sparse mappings
> with few pages mapped. Otherwise, consider adding page walk support in
> the move_pages() implementation, enhance the API(new flag?) to return
> address range to numa node information. The page walk optimization
> would certainly make a difference for usability.
>
> We can have applications(Like Oracle DB) having processes with large sparse
> mappings(in TBs)Â with only some areas of these mapped address range
> being accessed, basically large portions not having page tables backing it.
> This can become more prevalent on newer systems with multiple TBs of
> memory.
>
> Here is some data from pmap using move_pages() APIÂ with optimization.
> Following table compares time pmap takes to print address mapping of a
> large process, with numa node information using move_pages() api vs pmap
> using /proc numa_vamaps file.
>
> Running pmap command on a process with 1.3 TB of address space, with
> sparse mappings.
>
> ÂÂÂÂÂÂÂÂÂÂÂ ÂÂ ÂÂÂÂÂ Â ~1.3 TB sparseÂÂÂÂÂ 250G dense segment with hugepages.
> move_pagesÂÂÂÂÂÂÂÂÂÂÂÂÂ 8.33sÂÂÂÂÂÂÂÂÂÂÂÂÂ 3.14
> optimized move_pagesÂÂÂ 6.29sÂÂÂÂÂÂÂÂÂÂÂÂÂ 0.92
> /proc numa_vamapsÂÂÂÂÂÂ 0.08sÂÂÂÂÂÂÂÂÂÂÂÂÂ 0.04
>
> Â
> Second column is pmap time on a 250G address range of this process, which maps
> hugepages(THP & hugetlb).

The data look compelling to me. numa_vmap provides a much smoother user experience
for the analyst who is casting a wide net looking for the root of a performance issue.
Almost no waiting to see the data.

- Steve