[RFC 00/11] khugepaged: mTHP support
From: Nico Pache
Date: Wed Jan 08 2025 - 18:33:13 EST
The following series provides khugepaged and madvise collapse with the
capability to collapse regions to mTHPs.
To achieve this we generalize the khugepaged functions to no longer depend
on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
(defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
using a bitmap. After the PMD scan is done, we do binary recursion on the
bitmap to find the optimal mTHP sizes for the PMD range. The restriction
on max_ptes_none is removed during the scan, to make sure we account for
the whole PMD range. max_ptes_none is mapped to a 0-100 range to
determine how full a mTHP order needs to be before collapsing it.
Some design choices to note:
- bitmap structures are allocated dynamically because on some arch's
(like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
compile time leading to warnings.
- The recursion is masked through a stack structure.
- A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
64bit on x86. This provides some optimization on the bitmap operations.
if other arches/configs that have larger than 512 PTEs per PMD want to
compress their bitmap further we can change this value per arch.
Patch 1-2: Some refactoring to combine madvise_collapse and khugepaged
Patch 3: A minor "fix"/optimization
Patch 4: Refactor/rename hpage_collapse
Patch 5-7: Generalize khugepaged functions for arbitrary orders
Patch 8-11: The mTHP patches
This series acts as an alternative to Dev Jain's approach [1]. The two
series differ in a few ways:
- My approach uses a bitmap to store the state of the linear scan_pmd to
then determine potential mTHP batches. Devs incorporates his directly
into the scan, and will try each available order.
- Dev is attempting to optimize the locking, while my approach keeps the
locking changes to a minimum. I believe his changes are not safe for
uffd.
- Dev's changes only work for khugepaged not madvise_collapse (although
i think that was by choice and it could easily support madvise)
- Dev scales all khugepaged sysfs tunables by order, while im removing
the restriction of max_ptes_none and converting it to a scale to
determine a (m)THP threshold.
- Dev turns on khugepaged if any order is available while mine still
only runs if PMDs are enabled. I like Dev's approach and will most
likely do the same in my PATCH posting.
- mTHPs need their ref count updated to 1<<order, which Dev is missing.
Patch 11 was inspired by one of Dev's changes.
[1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@xxxxxxx/
Nico Pache (11):
introduce khugepaged_collapse_single_pmd to collapse a single pmd
khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
khugepaged: Don't allocate khugepaged mm_slot early
khugepaged: rename hpage_collapse_* to khugepaged_*
khugepaged: generalize hugepage_vma_revalidate for mTHP support
khugepaged: generalize alloc_charge_folio for mTHP support
khugepaged: generalize __collapse_huge_page_* for mTHP support
khugepaged: introduce khugepaged_scan_bitmap for mTHP support
khugepaged: add mTHP support
khugepaged: remove max_ptes_none restriction on the pmd scan
khugepaged: skip collapsing mTHP to smaller orders
include/linux/khugepaged.h | 4 +-
mm/huge_memory.c | 3 +-
mm/khugepaged.c | 436 +++++++++++++++++++++++++------------
3 files changed, 306 insertions(+), 137 deletions(-)
--
2.47.1