Re: [PATCH 4/5] mm/page_vma_mapped: use huge_ptep_get() for hugetlb
From: Dev Jain
Date: Fri Jun 26 2026 - 11:27:20 EST
On 26/06/26 7:40 pm, Lance Yang wrote:
>
> On Fri, Jun 26, 2026 at 06:53:10PM +0530, Dev Jain wrote:
>>
>>
>> On 26/06/26 1:18 pm, Lance Yang wrote:
>>>
>>> On Thu, Jun 25, 2026 at 11:29:53AM +0000, Dev Jain wrote:
>>>> check_pte() is the final validation step in page_vma_mapped_walk().
>>>> It reads pvmw->pte with ptep_get() to decide whether the entry maps
>>>> the PFN range being walked. For hugetlb VMAs, that pointer refers
>>>> to a hugetlb entry.
>>>>
>>>> On arches which provide their own huge_ptep_get() to dereference a huge
>>>> pte pointer, accessing via ptep_get() would cause pte_pfn(),
>>>> pte_present() etc to misbehave.
>>>>
>>>> It is not clear whether this has a trivially visible effect to userspace.
>>>>
>>>> Use huge_ptep_get() to dereference a huge pte pointer.
>>>>
>>>> Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()")
>>>> Cc: stable@xxxxxxxxxxxxxxx
>>>> Signed-off-by: Dev Jain <dev.jain@xxxxxxx>
>>>> ---
>>>> mm/page_vma_mapped.c | 8 +++++++-
>>>> 1 file changed, 7 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
>>>> index 2ccbabfb2cc17..18e1d341f463c 100644
>>>> --- a/mm/page_vma_mapped.c
>>>> +++ b/mm/page_vma_mapped.c
>>>> @@ -107,7 +107,13 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, pmd_t *pmdvalp,
>>>> static bool check_pte(struct page_vma_mapped_walk *pvmw, unsigned long pte_nr)
>>>> {
>>>> unsigned long pfn;
>>>> - pte_t ptent = ptep_get(pvmw->pte);
>>>> + pte_t ptent;
>>>> +
>>>> + if (is_vm_hugetlb_page(pvmw->vma))
>>>> + ptent = huge_ptep_get(pvmw->vma->vm_mm, pvmw->address,
>>>> + pvmw->pte);
>>>
>>> I think check_pte() can pass a wrong address to huge_ptep_get() ...
>>
>> Won't this be handled by rmap_walk_anon/rmap_walk_file - they are the ones
>> performing the rmap traversal and passing address to try_to_unmap_one/folio_referenced_one
>> etc ...
>
> Right, that should cover the rmap callbacks. The bit I was worried about
> is page_mapped_in_vma() though.
>
>>>
>>> Not sure that is wrong in the first place. For memory failure,
>>> page_mapped_in_vma() can be called with a poisoned tail page of a hugetlb
>>> folio. In that case, pvmw->address need not be hugepage-aligned.
>>>
>>> @Miaohe
>
> For hugetlb memory failure we start with the poisoned PFN:
>
> static int try_memory_failure_hugetlb(unsigned long pfn, int flags)
> {
> ...
> struct page *p = pfn_to_page(pfn);
> struct folio *folio;
> ...
>
> folio = page_folio(p);
>
> ...
>
> if (!hwpoison_user_mappings(folio, p, pfn, flags)) {
> ...
> }
>
> ...
> }
>
> and pass the same p down:
>
> static bool hwpoison_user_mappings(struct folio *folio, struct page *p,
> unsigned long pfn, int flags)
> {
> ...
>
> collect_procs(folio, p, &tokill, flags & MF_ACTION_REQUIRED);
>
> ...
> }
>
> static void collect_procs(const struct folio *folio, const struct page *page,
> struct list_head *tokill, int force_early)
> {
> ...
>
> if (unlikely(folio_test_ksm(folio)))
> ...
> else if (folio_test_anon(folio))
> collect_procs_anon(folio, page, tokill, force_early);
> else
> ...
> }
>
> So collect_procs_anon() still gets the poisoned page, not &folio->page:
>
> static void collect_procs_anon(const struct folio *folio,
> const struct page *page, struct list_head *to_kill,
> int force_early)
> {
> ...
>
> pgoff = page_pgoff(folio, page);
> rcu_read_lock();
> for_each_process(tsk) {
> ...
>
> anon_vma_interval_tree_foreach(vmac, &av->rb_root,
> pgoff, pgoff) {
> ...
> addr = page_mapped_in_vma(page, vma);
> ...
> }
> }
> rcu_read_unlock();
> anon_vma_unlock_read(av);
> }
>
> page_mapped_in_vma() then builds pvmw for that page:
>
> unsigned long page_mapped_in_vma(const struct page *page,
> struct vm_area_struct *vma)
> {
> const struct folio *folio = page_folio(page);
> struct page_vma_mapped_walk pvmw = {
> .pfn = page_to_pfn(page),
> .nr_pages = 1,
> .vma = vma,
> .flags = PVMW_SYNC,
> };
>
> pvmw.address = vma_address(vma, page_pgoff(folio, page), 1);
> ...
> }
>
> and page_pgoff() includes the subpage index:
>
> static inline pgoff_t page_pgoff(const struct folio *folio,
> const struct page *page)
> {
> return folio->index + folio_page_idx(folio, page);
> }
>
> So if the poisoned PFN points to a tail page, pvmw->address can be offset
> from the start of the hugetlb mapping by
>
> folio_page_idx(folio, page) << PAGE_SHIFT
>
> Should check_pte() pass the hugepage-aligned address to huge_ptep_get()
> for that case?
Thanks! This looks correct.
I can indeed fix this up in check_pte. But in the memory-failure path
it has always been confusing to me for hugetlb folios why we are bothering
with the tail page. I am sure that area can also be simplified. But for
now I'll just do a simple fix here itself.
>
> Cheers, Lance
>
>>>
>>> For arm64, CONT_PMD_SIZE is one supported HugeTLB size. With such a VMA,
>>> page_vma_mapped_walk() passes that size to hugetlb_walk():
>>>
>>> bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>>> {
>>> ...
>>> if (unlikely(is_vm_hugetlb_page(vma))) {
>>> ...
>>> pvmw->pte = hugetlb_walk(vma, pvmw->address, size);
>>> ...
>>> }
>>> ...
>>> }
>>>
>>> hugetlb_walk() then calls arm64 huge_pte_offset(mm, addr, sz). For
>>> sz == CONT_PMD_SIZE, huge_pte_offset() aligns its local addr before
>>> calculating pmdp:
>>>
>>> pte_t *huge_pte_offset(struct mm_struct *mm,
>>> unsigned long addr, unsigned long sz)
>>> {
>>> ...
>>> if (sz == CONT_PMD_SIZE)
>>> addr &= CONT_PMD_MASK;
>>>
>>> pmdp = pmd_offset(pudp, addr);
>>> pmd = READ_ONCE(*pmdp);
>>> ...
>>> }
>>>
>>> So for that case, pvmw->pte is calculated from the aligned addr, not
>>> necessarily from the original pvmw->address. But check_pte() passes the
>>> original address together with pvmw->pte:
>>>
>>> + ptent = huge_ptep_get(pvmw->vma->vm_mm, pvmw->address,
>>> + pvmw->pte);
>>>
>>> arm64 then uses that addr again to choose ncontig:
>>>
>>> pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
>>> {
>>> ...
>>> ncontig = find_num_contig(mm, addr, ptep, &pgsize);
>>> for (i = 0; i < ncontig; i++, ptep++) {
>>> ...
>>> }
>>> return orig_pte;
>>> }
>>>
>>> static int find_num_contig(struct mm_struct *mm, unsigned long addr,
>>> pte_t *ptep, size_t *pgsize)
>>> {
>>> pgd_t *pgdp = pgd_offset(mm, addr);
>>> p4d_t *p4dp;
>>> pud_t *pudp;
>>> pmd_t *pmdp;
>>>
>>> *pgsize = PAGE_SIZE;
>>> p4dp = p4d_offset(pgdp, addr);
>>> pudp = pud_offset(p4dp, addr);
>>> pmdp = pmd_offset(pudp, addr);
>>> if ((pte_t *)pmdp == ptep) {
>>> *pgsize = PMD_SIZE;
>>> return CONT_PMDS;
>>> }
>>> return CONT_PTES;
>>> }
>>>
>>> With a tail address, pmdp may no longer point at pvmw->pte, so
>>> find_num_contig() can return CONT_PTES for a CONT_PMD HugeTLB mapping.
>>>
>>> On 16K arm64, that changes ncontig from 32 to 128. So huge_ptep_get()
>>> can walk past the CONT_PMD entries, and possibly past the PMD table.
>>>
>>> Should check_pte() pass the address matching pvmw->pte, sth like:
>>>
>>> ---8<---
>>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
>>> index 406fd50bbd8f..58463493bd3d 100644
>>> --- a/mm/page_vma_mapped.c
>>> +++ b/mm/page_vma_mapped.c
>>> @@ -109,11 +109,14 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw, unsigned long pte_nr)
>>> unsigned long pfn;
>>> pte_t ptent;
>>>
>>> - if (is_vm_hugetlb_page(pvmw->vma))
>>> - ptent = huge_ptep_get(pvmw->vma->vm_mm, pvmw->address,
>>> - pvmw->pte);
>>> - else
>>> + if (is_vm_hugetlb_page(pvmw->vma)) {
>>> + struct hstate *hstate = hstate_vma(pvmw->vma);
>>> + unsigned long haddr = pvmw->address & huge_page_mask(hstate);
>>> +
>>> + ptent = huge_ptep_get(pvmw->vma->vm_mm, haddr, pvmw->pte);
>>> + } else {
>>> ptent = ptep_get(pvmw->pte);
>>> + }
>>>
>>> if (pvmw->flags & PVMW_MIGRATION) {
>>> const softleaf_t entry = softleaf_from_pte(ptent);
>>> --
>>>
>>> while leaving pvmw->address unchanged for page_mapped_in_vma()?
>>>
>>> Cheers, Lance
>>>
>>>> + else
>>>> + ptent = ptep_get(pvmw->pte);
>>>>
>>>> if (pvmw->flags & PVMW_MIGRATION) {
>>>> const softleaf_t entry = softleaf_from_pte(ptent);
>>>> --
>>>> 2.43.0
>>>>
>>>>
>>
>>
>