Re: [PATCH 4/5] mm/page_vma_mapped: use huge_ptep_get() for hugetlb
From: David Hildenbrand (Arm)
Date: Mon Jun 29 2026 - 04:11:51 EST
On 6/29/26 09:48, Lance Yang wrote:
>
> On Mon, Jun 29, 2026 at 09:25:48AM +0200, David Hildenbrand (Arm) wrote:
>> On 6/29/26 08:48, Dev Jain wrote:
>>>
>>>
>>>
>>> Sashiko notes other places:
>>>
>>> https://sashiko.dev/#/patchset/20260625112955.3254283-1-dev.jain%40arm.com
>>
>> Yeah, that looks shaky. We do seem to have a bunch of these cases, primarily
>>from pagewalk code (where some users like pagemap need the actual address).
>
> Indeed ...
>
>> I think we have two options
>>
>> 1) To prevent any (further) issues, make huge_ptep_get() always consume the
>> hstate, and let the arch code deal with aligning it. Invasive.
>
> Kinda lean toward option 1, even if it's more invasive. If we pass the
> hstate down, each arch can figure out the right addr from there.
>
>> 2) Make the arch code handle aligning without the hstate.
>>
>> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
>> index 30772a909aea3..303a1b74796c9 100644
>> --- a/arch/arm64/mm/hugetlbpage.c
>> +++ b/arch/arm64/mm/hugetlbpage.c
>> @@ -126,6 +126,9 @@ pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
>> return orig_pte;
>>
>> ncontig = find_num_contig(mm, addr, ptep, &pgsize);
>> + ptep = PTR_ALIGN_DOWN(ptep, sizeof(*ptep) * ncontig);
>> + orig_pte = __ptep_get(ptep);
>> +
>> for (i = 0; i < ncontig; i++, ptep++) {
>> pte_t pte = __ptep_get(ptep);
>>
>> (nshift/order instead of ncontig might avoid a multiplication, but not sure if that matters in practice)
>>
>> IIUC, that's similar to what huge_ptep_get() does on ppc.
>>
>>
>> static inline pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
>> {
>> if (ptep_is_8m_pmdp(mm, addr, ptep))
>> ptep = pte_offset_kernel((pmd_t *)ptep, ALIGN_DOWN(addr, SZ_8M));
>> return ptep_get(ptep);
>> }
>>
>> I'd assume we could do the same on riscv. Besides that, I don't think any arch has cont
>> entries.
>
> AFAICT, for huge_ptep_get() the addr users are arm64 and powerpc, riscv
> doesn't really care about addr there. Looks mostly arm64-specific ...
powerpc handles it correctly in the weird "span two PMD entries" case by
aligning the PMD down.
Risc-v copied from arm64, but can simply derive the #entries from the PTE value.
it doesn't have to re-walk the table using the address.
But I think the following is required to fix, no?
diff --git a/arch/riscv/mm/hugetlbpage.c b/arch/riscv/mm/hugetlbpage.c
index a6d217112cf46..7e25cc13b3dba 100644
--- a/arch/riscv/mm/hugetlbpage.c
+++ b/arch/riscv/mm/hugetlbpage.c
@@ -5,6 +5,7 @@
#ifdef CONFIG_RISCV_ISA_SVNAPOT
pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
{
- unsigned long pte_num;
+ unsigned long pte_num, pte_order;
int i;
pte_t orig_pte = ptep_get(ptep);
@@ -12,7 +13,11 @@ pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr,
pte_t *ptep)
if (!pte_present(orig_pte) || !pte_napot(orig_pte))
return orig_pte;
- pte_num = napot_pte_num(napot_cont_order(orig_pte));
+ pte_order = napot_cont_order(orig_pte);
+ pte_num = napot_pte_num(pte_order);
+
+ ptep = PTR_ALIGN_DOWN(ptep, sizeof(*ptep) << pte_order);
+ orig_pte = ptep_get(ptep);
for (i = 0; i < pte_num; i++, ptep++) {
pte_t pte = ptep_get(ptep);
I'd prefer (2) as a simple stable fix first.
If we do (1) on top, huge_ptep_get() on arm64 could stop walking the page table
another time.
If we pass the hstate (or vma) to set_huge_pte_at(), huge_pte_clear(),
huge_ptep_get_and_clear(), we could likely get rid of the re-walk in
num_contig_ptes() entirely and possibly just remove it.
That would probably be cleanest.
--
Cheers,
David