Re: [PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking

From: Baolin Wang
Date: Thu Jul 25 2024 - 22:33:30 EST




On 2024/7/26 02:39, David Hildenbrand wrote:
We recently made GUP's common page table walking code to also walk
hugetlb VMAs without most hugetlb special-casing, preparing for the
future of having less hugetlb-specific page table walking code in the
codebase. Turns out that we missed one page table locking detail: page
table locking for hugetlb folios that are not mapped using a single
PMD/PUD.

Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB
hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the
page tables, will perform a pte_offset_map_lock() to grab the PTE table
lock.

However, hugetlb that concurrently modifies these page tables would
actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the
locks would differ. Something similar can happen right now with hugetlb
folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS.

Let's make huge_pte_lockptr() effectively uses the same PT locks as any
core-mm page table walker would.

Thanks for raising the issue again. I remember fixing this issue 2 years ago in commit fac35ba763ed ("mm/hugetlb: fix races when looking up a CONT-PTE/PMD size hugetlb page"), but it seems to be broken again.

There is one ugly case: powerpc 8xx, whereby we have an 8 MiB hugetlb
folio being mapped using two PTE page tables. While hugetlb wants to take
the PMD table lock, core-mm would grab the PTE table lock of one of both
PTE page tables. In such corner cases, we have to make sure that both
locks match, which is (fortunately!) currently guaranteed for 8xx as it
does not support SMP.

Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code")
Cc: <stable@xxxxxxxxxxxxxxx>
Signed-off-by: David Hildenbrand <david@xxxxxxxxxx>
---
include/linux/hugetlb.h | 25 ++++++++++++++++++++++---
1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index c9bf68c239a01..da800e56fe590 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -944,10 +944,29 @@ static inline bool htlb_allow_alloc_fallback(int reason)
static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
struct mm_struct *mm, pte_t *pte)
{
- if (huge_page_size(h) == PMD_SIZE)
+ VM_WARN_ON(huge_page_size(h) == PAGE_SIZE);
+ VM_WARN_ON(huge_page_size(h) >= P4D_SIZE);
+
+ /*
+ * hugetlb must use the exact same PT locks as core-mm page table
+ * walkers would. When modifying a PTE table, hugetlb must take the
+ * PTE PT lock, when modifying a PMD table, hugetlb must take the PMD
+ * PT lock etc.
+ *
+ * The expectation is that any hugetlb folio smaller than a PMD is
+ * always mapped into a single PTE table and that any hugetlb folio
+ * smaller than a PUD (but at least as big as a PMD) is always mapped
+ * into a single PMD table.

ARM64 also supports cont-PMD size hugetlb, which is 32MiB size with a 4 KiB base page size. This means the PT locks for 32MiB hugetlb may race again, as we currently only hold one PMD lock for several PMD entries of a cont-PMD size hugetlb.

+ *
+ * If that does not hold for an architecture, then that architecture
+ * must disable split PT locks such that all *_lockptr() functions
+ * will give us the same result: the per-MM PT lock.
+ */
+ if (huge_page_size(h) < PMD_SIZE)
+ return pte_lockptr(mm, pte);
+ else if (huge_page_size(h) < PUD_SIZE)
return pmd_lockptr(mm, (pmd_t *) pte);

IIUC, as I said above, this change doesn't fix the inconsistent lock for cont-PMD size hugetlb for GUP, and it will also break the lock rule for unmapping/migrating a cont-PMD size hugetlb (use mm->page_table_lock before for cont-PMD size hugetlb before).

- VM_BUG_ON(huge_page_size(h) == PAGE_SIZE);
- return &mm->page_table_lock;
+ return pud_lockptr(mm, (pud_t *) pte);
}
#ifndef hugepages_supported