Re: [PATCH] mm: make page_mapped_in_vma() hugetlb walk aware

From: Matthew Wilcox
Date: Tue Jan 21 2025 - 00:00:32 EST


On Mon, Jan 20, 2025 at 09:18:49PM -0700, Jane Chu wrote:
> When a process consumes a UE in a page, the memory failure handler
> attempts to collect information for a potential SIGBUS.
> If the page is an anonymous page, page_mapped_in_vma(page, vma) is
> invoked in order to
> 1. retrieve the vaddr from the process' address space,
> 2. verify that the vaddr is indeed mapped to the poisoned page,
> where 'page' is the precise small page with UE.
>
> It's been observed that when injecting poison to a non-head subpage
> of an anonymous hugetlb page, no SIGBUS show up; while injecting to
> the head page produces a SIGBUS. The casue is that, though hugetlb_walk()
> returns a valid pmd entry (on x86), but check_pte() detects mismatch
> between the head page per the pmd and the input subpage. Thus the vaddr
> is considered not mapped to the subpage and the process is not collected
> for SIGBUS purpose. This is the calling stack
> collect_procs_anon
> page_mapped_in_vma
> page_vma_mapped_walk
> hugetlb_walk
> huge_pte_lock
> check_pte
>
> It seems that the most obvious place to fix the issue is by making
> page_mapped_in_vma() hugetlb walk aware. The precise subpage in the
> input is useful in providing PAGE_SIZE granularity vaddr.

I don't like this solution because it adds yet another special case for
hugetlb. If we don't split a PMD-mapped THP, we'd have the same
problem, right?

check_pte() would succeed if we set pvmw->pfn to folio_pfn() and
pvmw->nr_pages to folio_nr_pages(), right? I just don't know what else
might be affected by that.

I like one of these two options:

@@ -206,6 +206,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
pvmw->pte = hugetlb_walk(vma, pvmw->address, size);
if (!pvmw->pte)
return false;
+ pvmw->pte += pvmw->address & (size - PAGE_SIZE);

pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte);
if (!check_pte(pvmw))

(that needs a bit of tidying up; you can't just do that, but I think
you get the basic idea -- correct the pte to point to the precise page
instead of the hugetlb pfn)


The option I really prefer is much more work but matches our preferred
direction of getting rid of hugetlb specific code. Something like this:

@@ -192,27 +192,6 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
if (pvmw->pmd && !pvmw->pte)
return not_found(pvmw);

- if (unlikely(is_vm_hugetlb_page(vma))) {
- struct hstate *hstate = hstate_vma(vma);
- unsigned long size = huge_page_size(hstate);
- /* The only possible mapping was handled on last iteration */
[...]
- pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte);
- if (!check_pte(pvmw))
- return not_found(pvmw);
- return true;
- }
-
end = vma_address_end(pvmw);
if (pvmw->pte)
goto next_pte;
@@ -229,7 +208,19 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw continue;
}
pud = pud_offset(p4d, pvmw->address);
- if (!pud_present(*pud)) {
+ pude = *pud;
+ if (pud_trans_huge(pude) ||
+ (pud_present(pude) && pud_devmap(pude))) {
+ pvmw->ptl = pud_lock(mm, pvmw->pud);
+ ...
+ if (likely(pud_trans_huge(pude) || pud_devmap(pude))) {
+ if (pvmw->flags & PVMW_MIGRATION)
+ return not_found(pvmw);
+ if (!check_pud(pud_pfn(pude), pvmw))
+ return not_found(pvmw);
+ return true;
+ }
+ } else if (!pud_present(pude)) {
step_forward(pvmw, PUD_SIZE);
continue;
}

ie get rid of all the hugetlb-specific code, and add support for the
PUD level to the common code. You'd also need to write check_pud().

I'll understand if you don't want to do all the extra work. And
thanks for tracking down this bug.