[+linux-mm & others]
On Tue, Jan 16, 2024 at 9:02 PM Lance Yang <ioworker0@xxxxxxxxx> wrote:
This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1].
Introduce a new madvise mode, MADV_TRY_COLLAPSE, that allows users to
make a least-effort attempt at a synchronous collapse of memory at
their own expense.
The only difference from MADV_COLLAPSE is that the new hugepage allocation
avoids direct reclaim and/or compaction, quickly failing on allocation errors.
The benefits of this approach are:
* CPU is charged to the process that wants to spend the cycles for the THP
* Avoid unpredictable timing of khugepaged collapse
* Prevent unpredictable stalls caused by direct reclaim and/or compaction
Semantics
This call is independent of the system-wide THP sysfs settings, but will
fail for memory marked VM_NOHUGEPAGE. If the ranges provided span
multiple VMAs, the semantics of the collapse over each VMA is independent
from the others. This implies a hugepage cannot cross a VMA boundary. If
collapse of a given hugepage-aligned/sized region fails, the operation may
continue to attempt collapsing the remainder of memory specified.
The memory ranges provided must be page-aligned, but are not required to
be hugepage-aligned. If the memory ranges are not hugepage-aligned, the
start/end of the range will be clamped to the first/last hugepage-aligned
address covered by said range. The memory ranges must span at least one
hugepage-sized region.
All non-resident pages covered by the range will first be
swapped/faulted-in, before being internally copied onto a freshly
allocated hugepage. Unmapped pages will have their data directly
initialized to 0 in the new hugepage. However, for every eligible
hugepage aligned/sized region to-be collapsed, at least one page must
currently be backed by memory (a PMD covering the address range must
already exist).
Allocation for the new hugepage will not enter direct reclaim and/or
compaction, quickly failing if allocation fails. When the system has
multiple NUMA nodes, the hugepage will be allocated from the node providing
the most native pages. This operation operates on the current state of the
specified process and makes no persistent changes or guarantees on how pages
will be mapped, constructed, or faulted in the future.
Return Value
If all hugepage-sized/aligned regions covered by the provided range were
either successfully collapsed, or were already PMD-mapped THPs, this
operation will be deemed successful. On success, madvise(2) returns 0.
Else, -1 is returned and errno is set to indicate the error for the
most-recently attempted hugepage collapse. Note that many failures might
have occurred, since the operation may continue to collapse in the event a
single hugepage-sized/aligned region fails.
ENOMEM Memory allocation failed or VMA not found
EBUSY Memcg charging failed
EAGAIN Required resource temporarily unavailable. Try again
might succeed.
EINVAL Other error: No PMD found, subpage doesn't have Present
bit set, "Special" page no backed by struct page, VMA
incorrectly sized, address not page-aligned, ...
Use Cases
An immediate user of this new functionality is the Go runtime heap allocator
that manages memory in hugepage-sized chunks. In the past, whether it was a
newly allocated chunk through mmap() or a reused chunk released by
madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
respectively. However, both approaches resulted in performance issues; for
both scenarios, there could be entries into direct reclaim and/or compaction,
leading to unpredictable stalls[4]. Now, the allocator can confidently use
madvise(MADV_TRY_COLLAPSE) to attempt the allocation of huge pages.
[1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77
[2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a
[3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af
[4] https://github.com/golang/go/issues/63334
Thanks for the patch, Lance, and thanks for providing the links above,
referring to issues Go has seen.
I've reached out to the Go team to try and understand their use case,
and how we could help. It's not immediately clear whether a
lighter-weight MADV_COLLAPSE is the answer, but it could turn out to
be.
That said, with respect to the implementation, should a need for a
lighter-weight MADV_COLLAPSE be warranted, I'd personally like to see
process_madvise(2) be the "v2" of madvise(2), where we can start
leveraging the forward-facing flags argument for these different
advice flavors. We'd need to safely revert v5.10 commit a68a0262abdaa
("mm/madvise: remove racy mm ownership check") so that
process_madvise(2) can always operate on self. IIRC, this was ~ the
plan we landed on during MADV_COLLAPSE dev discussions (i.e. pick a
sane default, and implement options in flags down the line).