Re: [PATCH V2 0/6] VA to numa node information

From: Prakash Sangappa
Date: Fri Sep 14 2018 - 14:06:13 EST




On 9/14/18 9:01 AM, Steven Sistare wrote:
On 9/14/2018 1:56 AM, Michal Hocko wrote:
On Thu 13-09-18 15:32:25, prakash.sangappa wrote:

The proc interface provides an efficient way to export address range
to numa node id mapping information compared to using the API.
Do you have any numbers?

For example, for sparsely populated mappings, if a VMA has large portions
not have any physical pages mapped, the page walk done thru the /proc file
interface can skip over non existent PMDs / ptes. Whereas using the
API the application would have to scan the entire VMA in page size units.
What prevents you from pre-filtering by reading /proc/$pid/maps to get
ranges of interest?
That works for skipping holes, but not for skipping huge pages. I did a
quick experiment to time move_pages on a 3 GHz Xeon and a 4.18 kernel.
Allocate 128 GB and touch every small page. Call move_pages with nodes=NULL
to get the node id for all pages, passing 512 consecutive small pages per
call to move_nodes. The total move_nodes time is 1.85 secs, and 55 nsec
per page. Extrapolating to a 1 TB range, it would take 15 sec to retrieve
the numa node for every small page in the range. That is not terrible, but
it is not interactive, and it becomes terrible for multiple TB.


Also, for valid VMAs in 'maps' file, if the VMA is sparsely populated with physical pages,
the page walk can skip over non existing page table entires (PMDs) and so can be faster.

For example reading va range of a 400GB VMA which has few pages mapped
in beginning and few pages at the end and the rest of VMA does not have any pages, it
takes 0.001s using the /proc interface. Whereas with move_page() api passing 1024
consecutive small pages address, it takes about 2.4secs. This is on a similar system
running 4.19 kernel.