Re: [PATCH v1 02/16] arm64: hugetlb: Fix huge_ptep_get_and_clear() for non-present ptes

From: Ryan Roberts
Date: Thu Feb 06 2025 - 07:55:58 EST


On 06/02/2025 06:15, Anshuman Khandual wrote:
> On 2/5/25 20:39, Ryan Roberts wrote:
>> arm64 supports multiple huge_pte sizes. Some of the sizes are covered by
>> a single pte entry at a particular level (PMD_SIZE, PUD_SIZE), and some
>> are covered by multiple ptes at a particular level (CONT_PTE_SIZE,
>> CONT_PMD_SIZE). So the function has to figure out the size from the
>> huge_pte pointer. This was previously done by walking the pgtable to
>> determine the level, then using the PTE_CONT bit to determine the number
>> of ptes.
>
> Actually PTE_CONT gets used to determine if the entry is normal i.e
> PMD/PUD based huge page or cont PTE/PMD based huge page just to call
> into standard __ptep_get_and_clear() or specific get_clear_contig(),
> after determining find_num_contig() by walking the page table.
>
> PTE_CONT presence is only used to determine the switch above but not
> to determine the number of ptes for the mapping as mentioned earlier.

Sorry I don't really follow your distinction; PTE_CONT is used to decide whether
we are operating on a single entry (pte_cont()==false) or on multiple entires
(pte_cont()==true). For the multiple entry case, the level tells you the exact
number.

I can certainly tidy up this description a bit, but I think we both agree that
the value of PTE_CONT is one of the inputs into deciding how many entries need
to be operated on?

>
> There are two similar functions which determines the
>
> static int find_num_contig(struct mm_struct *mm, unsigned long addr,
> pte_t *ptep, size_t *pgsize)
> {
> pgd_t *pgdp = pgd_offset(mm, addr);
> p4d_t *p4dp;
> pud_t *pudp;
> pmd_t *pmdp;
>
> *pgsize = PAGE_SIZE;
> p4dp = p4d_offset(pgdp, addr);
> pudp = pud_offset(p4dp, addr);
> pmdp = pmd_offset(pudp, addr);
> if ((pte_t *)pmdp == ptep) {
> *pgsize = PMD_SIZE;
> return CONT_PMDS;
> }
> return CONT_PTES;
> }
>
> find_num_contig() already assumes that the entry is contig huge page and
> it just finds whether it is PMD or PTE based one. This always requires a
> prior PTE_CONT bit being set determination via pte_cont() before calling
> find_num_contig() in each instance.

Agreed.

>
> But num_contig_ptes() can get the same information without walking the
> page table and thus without predetermining if PTE_CONT is set or not.
> size can be derived from VMA argument when present.

Also agreed. But VMA is not provided to this function. And because we want to
use it for kernel space mappings, I think it's a bad idea to pass VMA.

>
> static inline int num_contig_ptes(unsigned long size, size_t *pgsize)
> {
> int contig_ptes = 0;
>
> *pgsize = size;
>
> switch (size) {
> #ifndef __PAGETABLE_PMD_FOLDED
> case PUD_SIZE:
> if (pud_sect_supported())
> contig_ptes = 1;
> break;
> #endif
> case PMD_SIZE:
> contig_ptes = 1;
> break;
> case CONT_PMD_SIZE:
> *pgsize = PMD_SIZE;
> contig_ptes = CONT_PMDS;
> break;
> case CONT_PTE_SIZE:
> *pgsize = PAGE_SIZE;
> contig_ptes = CONT_PTES;
> break;
> }
>
> return contig_ptes;
> }
>
> On a side note, why cannot num_contig_ptes() be used all the time and
> find_num_contig() be dropped ? OR am I missing something here.

There are 2 remaining users of find_num_contig() after my series:
huge_ptep_set_access_flags() and huge_ptep_set_wrprotect(). Both of them can
only be legitimately called for present ptes (so its safe to check pte_cont()).
huge_ptep_set_access_flags() already has the VMA so it would be easy to convert
to num_contig_ptes(). huge_ptep_set_wrprotect() doesn't have the VMA but I guess
you could do the trick where you take the size of the folio that the pte points to?

So yes, I think we could drop find_num_contig() and I agree it would be an
improvement.

But to be honest, grabbing the folio size also feels like a hack to me (we do
this in other places too). While today, the folio size is guarranteed to be be
the same size as the huge pte in practice, I'm not sure there is any spec that
mandates that?

Perhaps the most robust thing is to just have a PTE_CONT bit for the swap-pte so
we can tell the size of both present and non-present ptes, then do the table
walk trick to find the level. Shrug.

>
>>
>> But the PTE_CONT bit is only valid when the pte is present. For
>> non-present pte values (e.g. markers, migration entries), the previous
>> implementation was therefore erroniously determining the size. There is
>> at least one known caller in core-mm, move_huge_pte(), which may call
>> huge_ptep_get_and_clear() for a non-present pte. So we must be robust to
>> this case. Additionally the "regular" ptep_get_and_clear() is robust to
>> being called for non-present ptes so it makes sense to follow the
>> behaviour.
>
> With VMA argument and num_contig_ptes() dependency on PTE_CONT being set
> and the entry being mapped might not be required.
> >>
>> Fix this by using the new sz parameter which is now provided to the
>> function. Additionally when clearing each pte in a contig range, don't
>> gather the access and dirty bits if the pte is not present.
>
> Makes sense.
>
>>
>> An alternative approach that would not require API changes would be to
>> store the PTE_CONT bit in a spare bit in the swap entry pte. But it felt
>> cleaner to follow other APIs' lead and just pass in the size.
>
> Right, changing the arguments in the API will help solve this problem.
>
>>
>> While we are at it, add some debug warnings in functions that require
>> the pte is present.
>>
>> As an aside, PTE_CONT is bit 52, which corresponds to bit 40 in the swap
>> entry offset field (layout of non-present pte). Since hugetlb is never
>> swapped to disk, this field will only be populated for markers, which
>> always set this bit to 0 and hwpoison swap entries, which set the offset
>> field to a PFN; So it would only ever be 1 for a 52-bit PVA system where
>> memory in that high half was poisoned (I think!). So in practice, this
>> bit would almost always be zero for non-present ptes and we would only
>> clear the first entry if it was actually a contiguous block. That's
>> probably a less severe symptom than if it was always interpretted as 1
>> and cleared out potentially-present neighboring PTEs.
>>
>> Cc: <stable@xxxxxxxxxxxxxxx>
>> Fixes: 66b3923a1a0f ("arm64: hugetlb: add support for PTE contiguous bit")
>> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx>
>> ---
>> arch/arm64/mm/hugetlbpage.c | 54 ++++++++++++++++++++-----------------
>> 1 file changed, 29 insertions(+), 25 deletions(-)
>>
>> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
>> index 06db4649af91..328eec4bfe55 100644
>> --- a/arch/arm64/mm/hugetlbpage.c
>> +++ b/arch/arm64/mm/hugetlbpage.c
>> @@ -163,24 +163,23 @@ static pte_t get_clear_contig(struct mm_struct *mm,
>> unsigned long pgsize,
>> unsigned long ncontig)
>> {
>> - pte_t orig_pte = __ptep_get(ptep);
>> - unsigned long i;
>> -
>> - for (i = 0; i < ncontig; i++, addr += pgsize, ptep++) {
>> - pte_t pte = __ptep_get_and_clear(mm, addr, ptep);
>> -
>> - /*
>> - * If HW_AFDBM is enabled, then the HW could turn on
>> - * the dirty or accessed bit for any page in the set,
>> - * so check them all.
>> - */
>> - if (pte_dirty(pte))
>> - orig_pte = pte_mkdirty(orig_pte);
>> -
>> - if (pte_young(pte))
>> - orig_pte = pte_mkyoung(orig_pte);
>> + pte_t pte, tmp_pte;
>> + bool present;
>> +
>> + pte = __ptep_get_and_clear(mm, addr, ptep);
>> + present = pte_present(pte);
>> + while (--ncontig) {
>
> Although this does the right thing by calling __ptep_get_and_clear() once
> for non-contig huge pages but wondering if cont/non-cont separation should
> be maintained in the caller huge_ptep_get_and_clear(), keeping the current
> logical bifurcation intact.

To what benefit?

>
>> + ptep++;
>> + addr += pgsize;
>> + tmp_pte = __ptep_get_and_clear(mm, addr, ptep);
>> + if (present) {
>
> Checking for present entries makes sense here.
>
>> + if (pte_dirty(tmp_pte))
>> + pte = pte_mkdirty(pte);
>> + if (pte_young(tmp_pte))
>> + pte = pte_mkyoung(pte);
>> + }
>> }
>> - return orig_pte;
>> + return pte;
>> }
>>
>> static pte_t get_clear_contig_flush(struct mm_struct *mm,
>> @@ -401,13 +400,8 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
>> {
>> int ncontig;
>> size_t pgsize;
>> - pte_t orig_pte = __ptep_get(ptep);
>> -
>> - if (!pte_cont(orig_pte))
>> - return __ptep_get_and_clear(mm, addr, ptep);
>> -
>> - ncontig = find_num_contig(mm, addr, ptep, &pgsize);
>>
>> + ncontig = num_contig_ptes(sz, &pgsize);
>
> __ptep_get_and_clear() can still be called here if 'ncontig' is
> returned as 0 indicating a normal non-contig huge page thus
> keeping get_clear_contig() unchanged just to handle contig huge
> pages.

I think you're describing the case where num_contig_ptes() returns 0? The
intention, from my reading of the function, is that num_contig_ptes() returns
the number of ptes that need to be operated on (e.g. 1 for a single entry or N
for a contig block). It will only return 0 if called with an invalid huge size.
I don't believe it will ever "return 0 indicating a normal non-contig huge page".

Perhaps the right solution is to add a warning if returning 0?

>
>> return get_clear_contig(mm, addr, ptep, pgsize, ncontig);
>> }
>>
>> @@ -451,6 +445,8 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
>> pgprot_t hugeprot;
>> pte_t orig_pte;
>>
>> + VM_WARN_ON(!pte_present(pte));
>> +
>> if (!pte_cont(pte))
>> return __ptep_set_access_flags(vma, addr, ptep, pte, dirty);
>>
>> @@ -461,6 +457,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
>> return 0;
>>
>> orig_pte = get_clear_contig_flush(mm, addr, ptep, pgsize, ncontig);
>> + VM_WARN_ON(!pte_present(orig_pte));
>>
>> /* Make sure we don't lose the dirty or young state */
>> if (pte_dirty(orig_pte))
>> @@ -485,7 +482,10 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
>> size_t pgsize;
>> pte_t pte;
>>
>> - if (!pte_cont(__ptep_get(ptep))) {
>> + pte = __ptep_get(ptep);
>> + VM_WARN_ON(!pte_present(pte));
>> +
>> + if (!pte_cont(pte)) {
>> __ptep_set_wrprotect(mm, addr, ptep);
>> return;
>> }
>> @@ -509,8 +509,12 @@ pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
>> struct mm_struct *mm = vma->vm_mm;
>> size_t pgsize;
>> int ncontig;
>> + pte_t pte;
>>
>> - if (!pte_cont(__ptep_get(ptep)))
>> + pte = __ptep_get(ptep);
>> + VM_WARN_ON(!pte_present(pte));
>> +
>> + if (!pte_cont(pte))
>> return ptep_clear_flush(vma, addr, ptep);
>>
>> ncontig = find_num_contig(mm, addr, ptep, &pgsize);
>
> In all the above instances should not num_contig_ptes() be called to determine
> if a given entry is non-contig or contig huge page, thus dropping the need for
> pte_cont() and pte_present() tests as proposed here.

Yeah maybe. But as per above, we have options for how to do that. I'm not sure
which is preferable at the moment. What do you think? Regardless, I think that
cleanup would be a separate patch (which I'm happy to add for v2). For this bug
fix, I was trying to do the minimum.

Thanks,
Ryan