Re: [syzbot] general protection fault in PageHeadHuge

From: Peter Xu
Date: Fri Sep 30 2022 - 23:01:51 EST


On Fri, Sep 30, 2022 at 10:47:45PM -0400, Peter Xu wrote:
> Hi, Mike,
>
> On Fri, Sep 30, 2022 at 02:38:01PM -0700, Mike Kravetz wrote:
> > On 09/30/22 12:05, Peter Xu wrote:
> > > On Thu, Sep 29, 2022 at 04:33:53PM -0700, Mike Kravetz wrote:
> > > > I was able to do a little more debugging:
> > > >
> > > > As you know the hugetlb calling path to handle_userfault is something
> > > > like this,
> > > >
> > > > hugetlb_fault
> > > > mutex_lock(&hugetlb_fault_mutex_table[hash]);
> > > > ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h));
> > > > if (huge_pte_none_mostly())
> > > > hugetlb_no_page()
> > > > page = find_lock_page(mapping, idx);
> > > > if (!page) {
> > > > if (userfaultfd_missing(vma))
> > > > mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> > > > return handle_userfault()
> > > > }
> > > >
> > > > For anon mappings, find_lock_page() will never find a page, so as long
> > > > as huge_pte_none_mostly() is true we will call into handle_userfault().
> > > >
> > > > Since your analysis shows the testcase should never call handle_userfault() for
> > > > a write fault, I simply added a 'if (flags & FAULT_FLAG_WRITE) printk' before
> > > > the call to handle_userfault(). Sure enough, I saw plenty of printk messages.
> > > >
> > > > Then, before calling handle_userfault() I added code to take the page table
> > > > lock and test huge_pte_none_mostly() again. In every FAULT_FLAG_WRITE case,
> > > > this second test of huge_pte_none_mostly() was false. So, the condition
> > > > changed from the check in hugetlb_fault until the check (with page table
> > > > lock) in hugetlb_no_page.
> > > >
> > > > IIUC, the only code that should be modifying the pte in this test is
> > > > hugetlb_mcopy_atomic_pte(). It also holds the hugetlb_fault_mutex while
> > > > updating the pte.
> > > >
> > > > It 'appears' that hugetlb_fault is not seeing the updated pte and I can
> > > > only guess that it is due to some caching issues.
> > > >
> > > > After writing the pte in hugetlb_mcopy_atomic_pte() there is this comment.
> > > >
> > > > /* No need to invalidate - it was non-present before */
> > > > update_mmu_cache(dst_vma, dst_addr, dst_pte);
> > > >
> > > > I suspect that is true. However, it seems like this test depends on all
> > > > CPUs seeing the updated pte immediately?
> > > >
> > > > I added some TLB flushing to hugetlb_mcopy_atomic_pte, but it did not make
> > > > any difference. Suggestions would be appreciated as cache/tlb/??? flushing
> > > > issues take me a while to figure out.
> > >
> > > This morning when I went back and rethink the matter, I just found that the
> > > common hugetlb path handles private anonymous mappings with all empty page
> > > cache as you explained above. In that sense the two patches I posted may
> > > not really make sense even if they can pass the tests.. and maybe that's
> > > also the reason why the reservations got messed up. This is also something
> > > I found after I read more on the reservation code e.g. no matter private or
> > > shared hugetlb mappings we only reserve that only number of pages when mmap().
> > >
> > > Indeed if with that in mind the UFFDIO_COPY should also work because
> > > hugetlb fault handler checks pte first before page cache, so uffd missing
> > > should still work as expected.
> > >
> > > It makes sense especially for hugetlb to do that otherwise there can be
> > > plenty of zero huge pages cached in the page cache. I'm not sure whether
> > > this is the reason hugetlb does it differently (e.g. comparing to shmem?),
> > > it'll be great if I can get a confirmation. If it's true please ignore the
> > > two patches I posted.
> > >
> > > I think what you analyzed is correct in that the pte shouldn't go away
> > > after being armed once. One more thing I tried (actually yesterday) was
> > > SIGBUS the process when the write missing event was generated, and I can
> > > see the user stack points to the cmpxchg() of the pthread_mutex_lock(). It
> > > means indeed it moved forward and passed the mutex type check, it also
> > > means it should have seen a !none pte already with at least reading
> > > permission, in that sense it matches with "missing TLB" possibility
> > > experiment mentioned above, because for a missing TLB it should keep
> > > stucking at the read not write. It's still uncertain why the pte can go
> > > away somehow from under us and why it quickly re-appears according to your
> > > experiment.
> > >
> >
> > I 'think' it is more of a race with all cpus seeing the pte update. To be
> > honest, I can not wrap my head around how that can happen.
> >
> > I did not really like your idea of adding anon (or private) pages to the
> > page cache.
>
> I don't like that either, especially when I notice it breaks the
> reservations... :)
>
> Note that my previous patch wasn't adding anon or private pages into page
> cache. PageAnon() will be false for those pages being added. I was
> literally doing insertion of page cache for UFFDIO_COPY, rather than
> directly making it anonymous. Actually that's what shmem does.
>
> > As you discovered, there is code like reservations which depend
> > on current behavior.
> >
> > It seems to me that for 'missing' hugetlb faults there are two specific cases:
> > 1) Shared or file backed mappings. In this case, the page cache is the
> > 'source of truth'. If there is not a page in the page cache, then we
> > hand off to userfault with VM_UFFD_MISSING.
> > 2) anon or private mappings. In this case, pages are not in the page cache.
> > The page table is the 'source of truth'.
>
> Sorry, I can't say I fully agree with it.
>
> IMHO the missing mode is really about page cache. Let's imagine when we
> create private mappings upon a hugetlbfs file that has some pages written.
> When page fault triggers, we should never generate a missing if cache hit,
> even if the pte is null. Maybe that's not exactly what you meant, but the
> wording is kind of misleading here.
>
> I'd say it's really about hugetlbfs is special in treating private pages.
> Note that shmem wasn't treat it like that as I mentioned. But AFAICT
> hugetlbfs is different in a good way because otherwise we could be wasting
> quite a few useless zero pages dangling in the page cache, and AFAICT
> that's still what we do with shmem - just try to create 20M shmem
> anonymouse file, privately map and write to them and see how many mem we'll
> consume.. It's not optimal, but still correct IMHO.
>
> > Early in hugetlb fault processing
> > we check the page table (huge_pte_none_mostly). However, as my debug code
> > has shown, checking the page table again with lock held will reveal that
> > the pte has in fact been updated.
>
> Right, I found it in the hard way. I should have been reading code more
> carefully: it's caused by CoW where we'll need a clear+flush to make sure
> TLB consistency (in hugetlb_wp):
>
> if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) {
> ClearHPageRestoreReserve(new_page);
>
> /* Break COW or unshare */
> huge_ptep_clear_flush(vma, haddr, ptep);
> mmu_notifier_invalidate_range(mm, range.start, range.end);
> page_remove_rmap(old_page, vma, true);
> hugepage_add_new_anon_rmap(new_page, vma, haddr);
> set_huge_pte_at(mm, haddr, ptep,
> make_huge_pte(vma, new_page, !unshare));
> SetHPageMigratable(new_page);
> /* Make the old page be freed below */
> new_page = old_page;
> }
>
> The early TLB flush was trying to avoid inconsistent TLB entries in
> different cores.
>
> So far I still don't know why the hugetlb commit changed the timing, but
> this time since I tracked the pgtables I'm more sure of its cause.
>
> >
> > My thought was that regular anon pages would have the same issue. So, I looked
> > at the calling code there. In do_anonymous_page() there is this block:
> >
> > /* Use the zero-page for reads */
> > if (!(vmf->flags & FAULT_FLAG_WRITE) &&
> > !mm_forbids_zeropage(vma->vm_mm)) {
> > entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
> > vma->vm_page_prot));
> > vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> > vmf->address, &vmf->ptl);
> > if (!pte_none(*vmf->pte)) {
> > update_mmu_tlb(vma, vmf->address, vmf->pte);
> > goto unlock;
> > }
> > ret = check_stable_address_space(vma->vm_mm);
> > if (ret)
> > goto unlock;
> > /* Deliver the page fault to userland, check inside PT lock */
> > if (userfaultfd_missing(vma)) {
> > pte_unmap_unlock(vmf->pte, vmf->ptl);
> > return handle_userfault(vmf, VM_UFFD_MISSING);
> > }
> > goto setpte;
> > }
> >
> > Notice that here the pte is checked while holding the page table lock while
> > to make the decision to call handle_userfault().
>
> Right. That's a great finding, thanks. It's just that I found it great
> only after I found the whole story on the CoW in progress.. I could have
> been done better.
>
> >
> > In my testing, if we check huge_pte_none_mostly() while holding the page table
> > lock before calling handle_userfault we will not experience the failure. Can
> > you see if this also resolves the issue in your environment? I do not love
> > this solution as I still can not explain how this code is missing the pte
> > update.
>
> Yes this patch will fix it which I tested. But IMHO there're quite a few
> wordings that are misleading as I tried to explain above on why page cache
> still matters (and matters the most IMHO for MISSING events).
>
> Meanwhile IMHO there's a better way to address this - we're in
> hugetlb_no_page() but we're relying on a pte that was read _without_
> pgtable lock. It means it can be either unstable or changed. We do proper
> check for most of the rest code path for hugetlb_no_page() on pte_same()
> but we just forgot to do that for userfaultfd.
>
> One example is IMO we shouldn't target this "check pte under locking" for
> private mappings only. If the pte changed for either private/shared
> mappings, it's probably already illegal to be inside hugetlb_no_page().
> Logically for shared mappings the pte can change in any form too as long as
> under pgtable lock. So the lock should logically apply to shared mappings
> too, IMHO.
>
> In summary, I attached another patch that addressed it in the way I
> described above (only compile tested; even without writting the commit
> message because I need to go very soon). Let me know what do you think the
> best way to approach this.

Obviously I never remember to attach things when I meant to.. Sorry for the
noise. Attached this time.

>
> (During debugging this problem, I also found a few other issues; I'll
> not make this email any longer but will verify them next week and follow up
> from there)
>
> Thanks,
>
> >
> > From f910e7155d6831514165af35e0d75574124a4477 Mon Sep 17 00:00:00 2001
> > From: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
> > Date: Fri, 30 Sep 2022 13:45:08 -0700
> > Subject: [PATCH] hugetlb: check pte with page table lock before handing to
> > userfault
> >
> > In hugetlb_no_page we decide a page is missing if not present in the
> > page cache. This is perfectly valid for shared/file mappings where
> > pages must exist in the page cache. For anon/private mappings, the page
> > table must be checked. This is done early in hugetlb_fault processing
> > and is the reason we enter hugetlb_no_page. However, the early check is
> > made without holding the page table lock. There could be racing updates
> > to the pte entry, so check again with page table lock held before
> > deciding to call handle_userfault.
> >
> > Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
> > ---
> > mm/hugetlb.c | 21 ++++++++++++++++++++-
> > 1 file changed, 20 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 60e077ce6ca7..4cb44a4629b8 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -5560,10 +5560,29 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> > if (idx >= size)
> > goto out;
> > /* Check for page in userfault range */
> > - if (userfaultfd_missing(vma))
> > + if (userfaultfd_missing(vma)) {
> > + /*
> > + * For missing pages, the page cache (checked above) is
> > + * the 'source of truth' for shared mappings. For anon
> > + * mappings, the source of truth is the page table. We
> > + * already checked huge_pte_none_mostly() in
> > + * hugetlb_fault. However, there could be racing
> > + * updates. Check again while holding page table lock
> > + * before handing off to userfault.
> > + */
> > + if (!(vma->vm_flags & VM_MAYSHARE)) {
> > + ptl = huge_pte_lock(h, mm, ptep);
> > + if (!huge_pte_none_mostly(huge_ptep_get(ptep))) {
> > + spin_unlock(ptl);
> > + ret = 0;
> > + goto out;
> > + }
> > + spin_unlock(ptl);
> > + }
> > return hugetlb_handle_userfault(vma, mapping, idx,
> > flags, haddr, address,
> > VM_UFFD_MISSING);
> > + }
> >
> > page = alloc_huge_page(vma, haddr, 0);
> > if (IS_ERR(page)) {
> > --
> > 2.37.3
> >
>
> --
> Peter Xu

--
Peter Xu