RE: [RFC] mm/hugetlb: use mem policy when allocating surplus huge pages

From: Andrejczuk, Grzegorz
Date: Fri Feb 10 2017 - 10:49:46 EST

On Mike Kravetz, February 9, 2017 8:32 PM wrote:
> I believe another way of stating the problem is as follows:
> At mmap(MAP_HUGETLB) time a reservation for the number of huge pages
> is made. If surplus huge pages need to be (and can be) allocated to
> satisfy the reservation, they will be allocated at this time. However,
> the memory policy of the task is not taken into account when these
> pages are allocated to satisfy the reservation.
> Later when the task actually faults on pages in the mapping, reserved
> huge pages should be instantiated in the mapping. However, at fault time
> the task's memory policy is taken into account. It is possible that the
> pages reserved at mmap() time, are located on nodes such that they can
> not satisfy the request with the task's memory policy. In such a case,
> the allocation fails in the same way as if there was no reservation.
> Does that sound accurate?

Yes, thank you for taking time to rephrase it.
It's much cleaner now.

> Your problem statement (and solution) address the case where surplus huge
> pages need to be allocated at mmap() time to satisfy a reservation and
> later fault. I 'think' there is a more general problem huge page reservations
> and memory policy.

Yes, I fixed very specific code path. This problem is probably one of many
problems in the crossing of the memory policy and huge pages reservations.

> - In both cases, there are enough free pages to satisfy the reservation
> at mmap time. However, at fault time it can not get both the pages is
> requires from the specified node.

There is difference that interleaving in preallocated huge page is well known
and expected, when in overcommit all the pages might or might not be assigned
to the requested NUMA node. Also after setting nr_hugepages it is possible
to check number of the huge pages reserved for each node by:
cat /sys/devices/system/node/nodeX/hugepages/hugepages-2048kB/nr_hugepages
with nr_overcommit_hugepages it is impossible.

> I'm thinking we may need to expand the reservation tracking to be
> per-node like free_huge_pages_node and others. Like the code below,
> we need to take memory policy into account at reservation time.
> Thoughts?

Are amounts of free, allocated and surplus huge pages tracked in sysfs mentioned above?
My limited understanding of this problem is that obtaining all the memory policies
requires struct vm_area (for bind, preferred) and address (for interleave).
The first is lost in hugetlb_reserve_pages, the latter is lost when file->mmap is called.
So reservation of the huge pages needs to be done in mmap_region function
before calling file->mmap and I think this requires some new hugetlb API.

Best Regards,