Re: [RFC PATCH 00/39] 1G page support for guest_memfd

From: Michal Hocko
Date: Wed Sep 11 2024 - 02:57:01 EST


Cc Oscar for awareness

On Tue 10-09-24 23:43:31, Ackerley Tng wrote:
> Hello,
>
> This patchset is our exploration of how to support 1G pages in guest_memfd, and
> how the pages will be used in Confidential VMs.
>
> The patchset covers:
>
> + How to get 1G pages
> + Allowing mmap() of guest_memfd to userspace so that both private and shared
> memory can use the same physical pages
> + Splitting and reconstructing pages to support conversions and mmap()
> + How the VM, userspace and guest_memfd interact to support conversions
> + Selftests to test all the above
> + Selftests also demonstrate the conversion flow between VM, userspace and
> guest_memfd.
>
> Why 1G pages in guest memfd?
>
> Bring guest_memfd to performance and memory savings parity with VMs that are
> backed by HugeTLBfs.
>
> + Performance is improved with 1G pages by more TLB hits and faster page walks
> on TLB misses.
> + Memory savings from 1G pages comes from HugeTLB Vmemmap Optimization (HVO).
>
> Options for 1G page support:
>
> 1. HugeTLB
> 2. Contiguous Memory Allocator (CMA)
> 3. Other suggestions are welcome!
>
> Comparison between options:
>
> 1. HugeTLB
> + Refactor HugeTLB to separate allocator from the rest of HugeTLB
> + Pro: Graceful transition for VMs backed with HugeTLB to guest_memfd
> + Near term: Allows co-tenancy of HugeTLB and guest_memfd backed VMs
> + Pro: Can provide iterative steps toward new future allocator
> + Unexplored: Managing userspace-visible changes
> + e.g. HugeTLB's free_hugepages will decrease if HugeTLB is used,
> but not when future allocator is used
> 2. CMA
> + Port some HugeTLB features to be applied on CMA
> + Pro: Clean slate
>
> What would refactoring HugeTLB involve?
>
> (Some refactoring was done in this RFC, more can be done.)
>
> 1. Broadly involves separating the HugeTLB allocator from the rest of HugeTLB
> + Brings more modularity to HugeTLB
> + No functionality change intended
> + Likely step towards HugeTLB's integration into core-mm
> 2. guest_memfd will use just the allocator component of HugeTLB, not including
> the complex parts of HugeTLB like
> + Userspace reservations (resv_map)
> + Shared PMD mappings
> + Special page walkers
>
> What features would need to be ported to CMA?
>
> + Improved allocation guarantees
> + Per NUMA node pool of huge pages
> + Subpools per guest_memfd
> + Memory savings
> + Something like HugeTLB Vmemmap Optimization
> + Configuration/reporting features
> + Configuration of number of pages available (and per NUMA node) at and
> after host boot
> + Reporting of memory usage/availability statistics at runtime
>
> HugeTLB was picked as the source of 1G pages for this RFC because it allows a
> graceful transition, and retains memory savings from HVO.
>
> To illustrate this, if a host machine uses HugeTLBfs to back VMs, and a
> confidential VM were to be scheduled on that host, some HugeTLBfs pages would
> have to be given up and returned to CMA for guest_memfd pages to be rebuilt from
> that memory. This requires memory to be reserved for HVO to be removed and
> reapplied on the new guest_memfd memory. This not only slows down memory
> allocation but also trims the benefits of HVO. Memory would have to be reserved
> on the host to facilitate these transitions.
>
> Improving how guest_memfd uses the allocator in a future revision of this RFC:
>
> To provide an easier transition away from HugeTLB, guest_memfd's use of HugeTLB
> should be limited to these allocator functions:
>
> + reserve(node, page_size, num_pages) => opaque handle
> + Used when a guest_memfd inode is created to reserve memory from backend
> allocator
> + allocate(handle, mempolicy, page_size) => folio
> + To allocate a folio from guest_memfd's reservation
> + split(handle, folio, target_page_size) => void
> + To take a huge folio, and split it to smaller folios, restore to filemap
> + reconstruct(handle, first_folio, nr_pages) => void
> + To take a folio, and reconstruct a huge folio out of nr_pages from the
> first_folio
> + free(handle, folio) => void
> + To return folio to guest_memfd's reservation
> + error(handle, folio) => void
> + To handle memory errors
> + unreserve(handle) => void
> + To return guest_memfd's reservation to allocator backend
>
> Userspace should only provide a page size when creating a guest_memfd and should
> not have to specify HugeTLB.
>
> Overview of patches:
>
> + Patches 01-12
> + Many small changes to HugeTLB, mostly to separate HugeTLBfs concepts from
> HugeTLB, and to expose HugeTLB functions.
> + Patches 13-16
> + Letting guest_memfd use HugeTLB
> + Creation of each guest_memfd reserves pages from HugeTLB's global hstate
> and puts it into the guest_memfd inode's subpool
> + Each folio allocation takes a page from the guest_memfd inode's subpool
> + Patches 17-21
> + Selftests for new HugeTLB features in guest_memfd
> + Patches 22-24
> + More small changes on the HugeTLB side to expose functions needed by
> guest_memfd
> + Patch 25:
> + Uses the newly available functions from patches 22-24 to split HugeTLB
> pages. In this patch, HugeTLB folios are always split to 4K before any
> usage, private or shared.
> + Patches 26-28
> + Allow mmap() in guest_memfd and faulting in shared pages
> + Patch 29
> + Enables conversion between private/shared pages
> + Patch 30
> + Required to zero folios after conversions to avoid leaking initialized
> kernel memory
> + Patch 31-38
> + Add selftests to test mapping pages to userspace, guest/host memory
> sharing and update conversions tests
> + Patch 33 illustrates the conversion flow between VM/userspace/guest_memfd
> + Patch 39
> + Dynamically split and reconstruct HugeTLB pages instead of always
> splitting before use. All earlier selftests are expected to still pass.
>
> TODOs:
>
> + Add logic to wait for safe_refcount [1]
> + Look into lazy splitting/reconstruction of pages
> + Currently, when the KVM_SET_MEMORY_ATTRIBUTES is invoked, not only is the
> mem_attr_array and faultability updated, the pages in the requested range
> are also split/reconstructed as necessary. We want to look into delaying
> splitting/reconstruction to fault time.
> + Solve race between folios being faulted in and being truncated
> + When running private_mem_conversions_test with more than 1 vCPU, a folio
> getting truncated may get faulted in by another process, causing elevated
> mapcounts when the folio is freed (VM_BUG_ON_FOLIO).
> + Add intermediate splits (1G should first split to 2M and not split directly to
> 4K)
> + Use guest's lock instead of hugetlb_lock
> + Use multi-index xarray/replace xarray with some other data struct for
> faultability flag
> + Refactor HugeTLB better, present generic allocator interface
>
> Please let us know your thoughts on:
>
> + HugeTLB as the choice of transitional allocator backend
> + Refactoring HugeTLB to provide generic allocator interface
> + Shared/private conversion flow
> + Requiring user to request kernel to unmap pages from userspace using
> madvise(MADV_DONTNEED)
> + Failing conversion on elevated mapcounts/pincounts/refcounts
> + Process of splitting/reconstructing page
> + Anything else!
>
> [1] https://lore.kernel.org/all/20240829-guest-memfd-lib-v2-0-b9afc1ff3656@xxxxxxxxxxx/T/
>
> Ackerley Tng (37):
> mm: hugetlb: Simplify logic in dequeue_hugetlb_folio_vma()
> mm: hugetlb: Refactor vma_has_reserves() to should_use_hstate_resv()
> mm: hugetlb: Remove unnecessary check for avoid_reserve
> mm: mempolicy: Refactor out policy_node_nodemask()
> mm: hugetlb: Refactor alloc_buddy_hugetlb_folio_with_mpol() to
> interpret mempolicy instead of vma
> mm: hugetlb: Refactor dequeue_hugetlb_folio_vma() to use mpol
> mm: hugetlb: Refactor out hugetlb_alloc_folio
> mm: truncate: Expose preparation steps for truncate_inode_pages_final
> mm: hugetlb: Expose hugetlb_subpool_{get,put}_pages()
> mm: hugetlb: Add option to create new subpool without using surplus
> mm: hugetlb: Expose hugetlb_acct_memory()
> mm: hugetlb: Move and expose hugetlb_zero_partial_page()
> KVM: guest_memfd: Make guest mem use guest mem inodes instead of
> anonymous inodes
> KVM: guest_memfd: hugetlb: initialization and cleanup
> KVM: guest_memfd: hugetlb: allocate and truncate from hugetlb
> KVM: guest_memfd: Add page alignment check for hugetlb guest_memfd
> KVM: selftests: Add basic selftests for hugetlb-backed guest_memfd
> KVM: selftests: Support various types of backing sources for private
> memory
> KVM: selftests: Update test for various private memory backing source
> types
> KVM: selftests: Add private_mem_conversions_test.sh
> KVM: selftests: Test that guest_memfd usage is reported via hugetlb
> mm: hugetlb: Expose vmemmap optimization functions
> mm: hugetlb: Expose HugeTLB functions for promoting/demoting pages
> mm: hugetlb: Add functions to add/move/remove from hugetlb lists
> KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
> KVM: guest_memfd: Allow mmapping guest_memfd files
> KVM: guest_memfd: Use vm_type to determine default faultability
> KVM: Handle conversions in the SET_MEMORY_ATTRIBUTES ioctl
> KVM: guest_memfd: Handle folio preparation for guest_memfd mmap
> KVM: selftests: Allow vm_set_memory_attributes to be used without
> asserting return value of 0
> KVM: selftests: Test using guest_memfd memory from userspace
> KVM: selftests: Test guest_memfd memory sharing between guest and host
> KVM: selftests: Add notes in private_mem_kvm_exits_test for mmap-able
> guest_memfd
> KVM: selftests: Test that pinned pages block KVM from setting memory
> attributes to PRIVATE
> KVM: selftests: Refactor vm_mem_add to be more flexible
> KVM: selftests: Add helper to perform madvise by memslots
> KVM: selftests: Update private_mem_conversions_test for mmap()able
> guest_memfd
>
> Vishal Annapurve (2):
> KVM: guest_memfd: Split HugeTLB pages for guest_memfd use
> KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page
>
> fs/hugetlbfs/inode.c | 35 +-
> include/linux/hugetlb.h | 54 +-
> include/linux/kvm_host.h | 1 +
> include/linux/mempolicy.h | 2 +
> include/linux/mm.h | 1 +
> include/uapi/linux/kvm.h | 26 +
> include/uapi/linux/magic.h | 1 +
> mm/hugetlb.c | 346 ++--
> mm/hugetlb_vmemmap.h | 11 -
> mm/mempolicy.c | 36 +-
> mm/truncate.c | 26 +-
> tools/include/linux/kernel.h | 4 +-
> tools/testing/selftests/kvm/Makefile | 3 +
> .../kvm/guest_memfd_hugetlb_reporting_test.c | 222 +++
> .../selftests/kvm/guest_memfd_pin_test.c | 104 ++
> .../selftests/kvm/guest_memfd_sharing_test.c | 160 ++
> .../testing/selftests/kvm/guest_memfd_test.c | 238 ++-
> .../testing/selftests/kvm/include/kvm_util.h | 45 +-
> .../testing/selftests/kvm/include/test_util.h | 18 +
> tools/testing/selftests/kvm/lib/kvm_util.c | 443 +++--
> tools/testing/selftests/kvm/lib/test_util.c | 99 ++
> .../kvm/x86_64/private_mem_conversions_test.c | 158 +-
> .../x86_64/private_mem_conversions_test.sh | 91 +
> .../kvm/x86_64/private_mem_kvm_exits_test.c | 11 +-
> virt/kvm/guest_memfd.c | 1563 ++++++++++++++++-
> virt/kvm/kvm_main.c | 17 +
> virt/kvm/kvm_mm.h | 16 +
> 27 files changed, 3288 insertions(+), 443 deletions(-)
> create mode 100644 tools/testing/selftests/kvm/guest_memfd_hugetlb_reporting_test.c
> create mode 100644 tools/testing/selftests/kvm/guest_memfd_pin_test.c
> create mode 100644 tools/testing/selftests/kvm/guest_memfd_sharing_test.c
> create mode 100755 tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.sh
>
> --
> 2.46.0.598.g6f2099f65c-goog

--
Michal Hocko
SUSE Labs