[LSF/MM TOPIC] Generating physically contiguous memory

From: Zi Yan
Date: Fri Feb 15 2019 - 17:20:42 EST


The Problem

----

Large pages and physically contiguous memory are important to devices, such as GPUs, FPGAs, NICs and RDMA controllers, because they can often reduce address translation overheads and hence achieve better performance when operating on large pages (2MB and beyond). The same can be said of CPU performance, of course, but there is an important difference: GPUs and high-throughput devices often take a more severe performance hit, in the event of a TLB miss, as compared to a CPU, because larger volume of in-flight work is stalled due to the TLB miss and the induced page table walks. The effect is sufficiently large that such devices *really* want highly reliable ways to allocate large pages to minimize TLB misses and reduce the duration of page table walks.



Due to the lack of flexibility, Approaches using memory reservation at boot time (such as hugetlbfs) are a compromise that would be nice to avoid. THPs, in general, seems to be a proper way to go because it is transparent to userspace and provides large pages, but it is not perfect yet. The community is still working on it since 1) THP size is limited by the page allocation system and 2) THP creation requires a lot of effort (e.g., memory compaction and page reclamation on the critical path of page allocations).




Possible solutions

----

1. I recently posted an RFC [1] about actively generating physically contiguous memory from in-use pages after page allocation. This RFC moves pages around and make them physically contiguous when possible. It is different from existing approaches, since it does not rely on page allocation. On the other hand, this approach is still affected by non-moveable pages scattered across the memory, which is highly related but orthogonal and one of whose possible solutions is proposed by Mel Gorman recently [2].




2. THPs could be a solution as it provide large pages. THP avoids memory reservation at boot time, but to meet the needs, i.e., a lot of large pages, of some of these high-throughput accelerators, we need to make it easier to produce large pages, namely increasing the successful rate of allocating THPs and decreasing the overheads of allocating them. Mel Gorman has posted a related patchset [3].


It is also possible to generate THPs in the background, either like what khugepaged does right now, or periodically perform memory compaction to lower whole memory fragmentation level, or having certain amount of THP pools for future use. But these solutions still face the same problem.




3. A more restricted but more reliable way might be using libhugetlbfs. It reserves memory, which is dedicated to large page allocations and hence requires less effort to obtain large pages. It also supports page sizes larger than 2MB, which further reduces address translation overheads. But AFAIK device drivers are not able to directly grab large pages from libhugetlbfs, which is something devices want.




4. Recently Matthew Wilcox mentioned his XArray is going to support arbitrary sized pages [4], which would help maintain physically contiguous ranges once created (aka my RFC). Once my RFC generates physically contiguous memory, XArrays would maintain the page size and prevent reclaim/compaction from breaking them. Getting arbitrary sized pages can still be beneficial to devices when larger than 2MB pages becomes very difficult to get.



Feel free to provide your comments.

Thanks.


[1] https://lore.kernel.org/lkml/20190215220856.29749-1-zi.yan@xxxxxxxx/

[2] https://lore.kernel.org/lkml/20181123114528.28802-1-mgorman@xxxxxxxxxxxxxxxxxxx/

[3] https://lore.kernel.org/lkml/20190118175136.31341-1-mgorman@xxxxxxxxxxxxxxxxxxx/

[4] https://lore.kernel.org/lkml/20190208042448.GB21860@xxxxxxxxxxxxxxxxxxxxxx/



--
Best Regards,
Yan Zi