Re: [PATCH 5/7] mm/thp: fix page_vma_mapped_walk() if huge page mapped by ptes
From: Kirill A. Shutemov
Date: Fri Jun 04 2021 - 12:24:05 EST
On Tue, Jun 01, 2021 at 02:13:21PM -0700, Hugh Dickins wrote:
> Running certain tests with a DEBUG_VM kernel would crash within hours,
> on the total_mapcount BUG() in split_huge_page_to_list(), while trying
> to free up some memory by punching a hole in a shmem huge page: split's
> try_to_unmap() was unable to find all the mappings of the page (which,
> on a !DEBUG_VM kernel, would then keep the huge page pinned in memory).
>
> Crash dumps showed two tail pages of a shmem huge page remained mapped
> by pte: ptes in a non-huge-aligned vma of a gVisor process, at the end
> of a long unmapped range; and no page table had yet been allocated for
> the head of the huge page to be mapped into.
>
> Although designed to handle these odd misaligned huge-page-mapped-by-pte
> cases, page_vma_mapped_walk() falls short by returning false prematurely
> when !pmd_present or !pud_present or !p4d_present or !pgd_present: there
> are cases when a huge page may span the boundary, with ptes present in
> the next.
Oh. My bad. I guess it was pain to debug.
> Restructure page_vma_mapped_walk() as a loop to continue in these cases,
> while keeping its layout much as before. Add a step_forward() helper to
> advance pvmw->address across those boundaries: originally I tried to use
> mm's standard p?d_addr_end() macros, but hit the same crash 512 times
> less often: because of the way redundant levels are folded together,
> but folded differently in different configurations, it was just too
> difficult to use them correctly; and step_forward() is simpler anyway.
>
> Merged various other minor fixes and cleanups into page_vma_mapped_walk()
> as I worked on it: which I find much easier to enumerate here than to
> prise apart into separate commits.
But it makes it harder to review...
> Handle all of the hugetlbfs PageHuge() case once at the start,
> so we don't need to worry about it again further down.
>
> Sometimes local copy of pvmw->page was used, sometimes pvmw->page:
> just use pvmw->page throughout (and continue to use pvmw->address
> throughout, though we could take a local copy instead).
>
> Use pmd_read_atomic() with barrier() instead of READ_ONCE() for pmde:
> some architectures (e.g. i386 with PAE) have a multi-word pmd entry,
> for which READ_ONCE() is not good enough.
>
> Re-evaluate pmde after taking lock, then use it in subsequent tests,
> instead of repeatedly dereferencing pvmw->pmd pointer.
>
> Rearrange the !pmd_present block to follow the same "return not_found,
> return not_found, return true" pattern as the block above it (note:
> returning not_found there is never premature, since the existence or
> prior existence of a huge pmd guarantees good alignment).
>
> Adjust page table boundary test in case address was not page-aligned.
>
> Reset pvmw->pte to NULL after unmapping that page table.
>
> Respect the 80-column line limit.
>
> Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()")
> Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx>
> Cc: <stable@xxxxxxxxxxxxxxx>
I tried to review it and superficially it looks good, but it has to be
split into bunch of patches.
> /* when pud is not present, pte will be NULL */
> - pvmw->pte = huge_pte_offset(mm, pvmw->address, page_size(page));
> + pvmw->pte = huge_pte_offset(mm, pvmw->address,
> + page_size(pvmw->page));
AFAICS, it exactly fits into 80-column.
> if (!pvmw->pte)
> return false;
>
> - pvmw->ptl = huge_pte_lockptr(page_hstate(page), mm, pvmw->pte);
> + pvmw->ptl = huge_pte_lockptr(page_hstate(pvmw->page),
> + mm, pvmw->pte);
And this one end on 79.
Hm?
--
Kirill A. Shutemov