Re: [PATCH v8 04/10] mm: thp: Support allocation of anonymous multi-size THP

From: David Hildenbrand
Date: Tue Dec 05 2023 - 06:16:30 EST

Next message: Jiri Olsa: "Re: [PATCH 1/2] perf/bpf: Allow a bpf program to suppress I/O signals."
Previous message: Ahmad Fatoum: "Re: [PATCH v7 2/2] arm64: boot: Support Flat Image Tree"
In reply to: Ryan Roberts: "Re: [PATCH v8 04/10] mm: thp: Support allocation of anonymous multi-size THP"
Next in thread: Barry Song: "Re: [PATCH v8 04/10] mm: thp: Support allocation of anonymous multi-size THP"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 05.12.23 11:48, Ryan Roberts wrote:

On 05/12/2023 01:24, Barry Song wrote:

On Tue, Dec 5, 2023 at 9:15 AM Barry Song <21cnbao@xxxxxxxxx> wrote:

On Mon, Dec 4, 2023 at 6:21 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:

Introduce the logic to allow THP to be configured (through the new sysfs
interface we just added) to allocate large folios to back anonymous
memory, which are larger than the base page size but smaller than
PMD-size. We call this new THP extension "multi-size THP" (mTHP).

mTHP continues to be PTE-mapped, but in many cases can still provide
similar benefits to traditional PMD-sized THP: Page faults are
significantly reduced (by a factor of e.g. 4, 8, 16, etc. depending on
the configured order), but latency spikes are much less prominent
because the size of each page isn't as huge as the PMD-sized variant and
there is less memory to clear in each page fault. The number of per-page
operations (e.g. ref counting, rmap management, lru list management) are
also significantly reduced since those ops now become per-folio.

Some architectures also employ TLB compression mechanisms to squeeze
more entries in when a set of PTEs are virtually and physically
contiguous and approporiately aligned. In this case, TLB misses will
occur less often.

The new behaviour is disabled by default, but can be enabled at runtime
by writing to /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled
(see documentation in previous commit). The long term aim is to change
the default to include suitable lower orders, but there are some risks
around internal fragmentation that need to be better understood first.

Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx>
---
include/linux/huge_mm.h | 6 ++-
mm/memory.c | 106 ++++++++++++++++++++++++++++++++++++----
2 files changed, 101 insertions(+), 11 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index bd0eadd3befb..91a53b9835a4 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr;
#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)

/*
- * Mask of all large folio orders supported for anonymous THP.
+ * Mask of all large folio orders supported for anonymous THP; all orders up to
+ * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
+ * (which is a limitation of the THP implementation).
*/
-#define THP_ORDERS_ALL_ANON BIT(PMD_ORDER)
+#define THP_ORDERS_ALL_ANON ((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))

/*
* Mask of all large folio orders supported for file THP.
diff --git a/mm/memory.c b/mm/memory.c
index 3ceeb0f45bf5..bf7e93813018 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4125,6 +4125,84 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
return ret;
}

+static bool pte_range_none(pte_t *pte, int nr_pages)
+{
+ int i;
+
+ for (i = 0; i < nr_pages; i++) {
+ if (!pte_none(ptep_get_lockless(pte + i)))
+ return false;
+ }
+
+ return true;
+}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static struct folio *alloc_anon_folio(struct vm_fault *vmf)
+{
+ gfp_t gfp;
+ pte_t *pte;
+ unsigned long addr;
+ struct folio *folio;
+ struct vm_area_struct *vma = vmf->vma;
+ unsigned long orders;
+ int order;
+
+ /*
+ * If uffd is active for the vma we need per-page fault fidelity to
+ * maintain the uffd semantics.
+ */
+ if (userfaultfd_armed(vma))
+ goto fallback;
+
+ /*
+ * Get a list of all the (large) orders below PMD_ORDER that are enabled
+ * for this vma. Then filter out the orders that can't be allocated over
+ * the faulting address and still be fully contained in the vma.
+ */
+ orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
+ BIT(PMD_ORDER) - 1);
+ orders = thp_vma_suitable_orders(vma, vmf->address, orders);
+
+ if (!orders)
+ goto fallback;
+
+ pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
+ if (!pte)
+ return ERR_PTR(-EAGAIN);
+
+ order = first_order(orders);
+ while (orders) {
+ addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
+ vmf->pte = pte + pte_index(addr);
+ if (pte_range_none(vmf->pte, 1 << order))
+ break;
+ order = next_order(&orders, order);
+ }
+
+ vmf->pte = NULL;
+ pte_unmap(pte);
+
+ gfp = vma_thp_gfp_mask(vma);
+
+ while (orders) {
+ addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
+ folio = vma_alloc_folio(gfp, order, vma, addr, true);
+ if (folio) {
+ clear_huge_page(&folio->page, addr, 1 << order);

Minor.

Do we have to constantly clear a huge page? Is it possible to let
post_alloc_hook()
finish this job by using __GFP_ZERO/__GFP_ZEROTAGS as
vma_alloc_zeroed_movable_folio() is doing?

I'm currently following the same allocation pattern as is done for PMD-sized
THP. In earlier versions of this patch I was trying to be smarter and use the
__GFP_ZERO/__GFP_ZEROTAGS as you suggest, but I was advised to keep it simple
and follow the existing pattern.

Yes, this should be optimized on top IMHO.

--
Cheers,

David / dhildenb

Next message: Jiri Olsa: "Re: [PATCH 1/2] perf/bpf: Allow a bpf program to suppress I/O signals."
Previous message: Ahmad Fatoum: "Re: [PATCH v7 2/2] arm64: boot: Support Flat Image Tree"
In reply to: Ryan Roberts: "Re: [PATCH v8 04/10] mm: thp: Support allocation of anonymous multi-size THP"
Next in thread: Barry Song: "Re: [PATCH v8 04/10] mm: thp: Support allocation of anonymous multi-size THP"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]