Re: [PATCH v6 5/9] mm: thp: Extend THP to allocate anonymous large folios

From: John Hubbard
Date: Fri Oct 27 2023 - 19:04:43 EST


On 9/29/23 04:44, Ryan Roberts wrote:

Hi Ryan,

A few clarifying questions below.

...
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2e7c338229a6..c4860476a1f5 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr;
#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
/*
- * Mask of all large folio orders supported for anonymous THP.
+ * Mask of all large folio orders supported for anonymous THP; all orders up to
+ * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
+ * (which is a limitation of the THP implementation).
*/
-#define THP_ORDERS_ALL_ANON BIT(PMD_ORDER)
+#define THP_ORDERS_ALL_ANON ((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
/*
* Mask of all large folio orders supported for file THP.
diff --git a/mm/memory.c b/mm/memory.c
index b5b82fc8e164..92ed9c782dc9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4059,6 +4059,87 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
return ret;
}
+static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
+{
+ int i;
+
+ if (nr_pages == 1)
+ return vmf_pte_changed(vmf);
+
+ for (i = 0; i < nr_pages; i++) {
+ if (!pte_none(ptep_get_lockless(vmf->pte + i)))
+ return true;

This seems like something different than the function name implies.
It's really confusing: for a single page case, return true if the
pte in the page tables has changed, yes that is very clear.

But then for multiple page cases, which is really the main
focus here--for that, claim that the range has changed if any
pte is present (!pte_none). Can you please help me understand
what this means?

+ }
+
+ return false;
+}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static struct folio *alloc_anon_folio(struct vm_fault *vmf)
+{
+ gfp_t gfp;
+ pte_t *pte;
+ unsigned long addr;
+ struct folio *folio;
+ struct vm_area_struct *vma = vmf->vma;
+ unsigned int orders;
+ int order;
+
+ /*
+ * If uffd is active for the vma we need per-page fault fidelity to
+ * maintain the uffd semantics.
+ */
+ if (userfaultfd_armed(vma))
+ goto fallback;
+
+ /*
+ * Get a list of all the (large) orders below PMD_ORDER that are enabled
+ * for this vma. Then filter out the orders that can't be allocated over
+ * the faulting address and still be fully contained in the vma.
+ */
+ orders = hugepage_vma_check(vma, vma->vm_flags, false, true, true,
+ BIT(PMD_ORDER) - 1);
+ orders = transhuge_vma_suitable(vma, vmf->address, orders);
+
+ if (!orders)
+ goto fallback;
+
+ pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
+ if (!pte)
+ return ERR_PTR(-EAGAIN);

pte_offset_map() can only fail due to:

a) Wrong pmd type. These include:
pmd_none
pmd_bad
pmd migration entry
pmd_trans_huge
pmd_devmap

b) __pte_map() failure

For (a), why is it that -EAGAIN is used here? I see that that
will lead to a re-fault, I got that far, but am missing something
still.

For (b), same question, actually. I'm not completely sure why
why a retry is going to fix a __pte_map() failure?


+
+ order = first_order(orders);
+ while (orders) {
+ addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
+ vmf->pte = pte + pte_index(addr);
+ if (!vmf_pte_range_changed(vmf, 1 << order))
+ break;
+ order = next_order(&orders, order);
+ }
+
+ vmf->pte = NULL;
+ pte_unmap(pte);
+
+ gfp = vma_thp_gfp_mask(vma);
+
+ while (orders) {
+ addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
+ folio = vma_alloc_folio(gfp, order, vma, addr, true);
+ if (folio) {
+ clear_huge_page(&folio->page, addr, 1 << order);
+ return folio;
+ }
+ order = next_order(&orders, order);
+ }

And finally: is it accurate to say that there are *no* special
page flags being set, for PTE-mapped THPs? I don't see any here,
but want to confirm.


thanks,
--
John Hubbard
NVIDIA