Re: [PATCH 1/3] mm/hugetlb: Fix race condition of uffd missing/minor handling

From: Peter Xu
Date: Mon Oct 03 2022 - 17:33:09 EST

On Mon, Oct 03, 2022 at 10:20:29AM -0700, Mike Kravetz wrote:
> On 10/03/22 11:56, Peter Xu wrote:
> > After the recent rework patchset of hugetlb locking on pmd sharing,
> > kselftest for userfaultfd sometimes fails on hugetlb private tests with
> > unexpected write fault checks.
> >
> > It turns out there's nothing wrong within the locking series regarding this
> > matter, but it could have changed the timing of threads so it can trigger
> > an old bug.
> >
> > The real bug is when we call hugetlb_no_page() we're not with the pgtable
> > lock. It means we're reading the pte values lockless. It's perfectly fine
> > in most cases because before we do normal page allocations we'll take the
> > lock and check pte_same() again. However before that, there are actually
> > two paths on userfaultfd missing/minor handling that may directly move on
> > with the fault process without checking the pte values.
> >
> > It means for these two paths we may be generating an uffd message based on
> > an unstable pte, while an unstable pte can legally be anything as long as
> > the modifier holds the pgtable lock.
> >
> > One example, which is also what happened in the failing kselftest and
> > caused the test failure, is that for private mappings CoW can happen on one
> > page. CoW requires pte being cleared before being replaced with a new page
> > for TLB coherency, but then there can be a race condition:
> >
> > thread 1 thread 2
> > -------- --------
> >
> > hugetlb_fault hugetlb_fault
> > private pte RO
> > hugetlb_wp
> > pgtable_lock()
> > huge_ptep_clear_flush
> > pte=NULL
> > hugetlb_no_page
> > generate uffd missing event
> > even if page existed!!
> > set_huge_pte_at
> > pgtable_unlock()
> Thanks for working on this Peter!
> I agree with this patch, but I suspect the above race is not possible. Why?
> In both cases, we take the hugetlb fault mutex when processing a huegtlb
> page fault. This means only one thread can execute the fault code for
> a specific mapping/index at a time. This is why I was so confused, and may
> remain a bit confused about how we end up with userfault processing a write
> fault. It was part of the reason for my 'unclear' wording about this being
> more about cpus not seeing updated values. Note that we do drop the fault
> mutex before calling handle_usefault, but by then we have already made the
> 'missing' determination.
> Thoughts? Perhaps, I am still confused.

It's my fault to have the commit message wrong, sorry. :) And thanks for
raising this question, I could have overlooked that.

It turns out it's not the CoW that's clearing the pte... it's the
wr-protect with huge_ptep_modify_prot_start(). So the race is with

Obviously when I was tracking the hpte changes I overlooked
huge_ptep_get_and_clear(), only seeing the CoW path and I'm pretty sure
that's already a bug which was obvious enough. I didn't prove they
happened at the same time during the MISSING event.

Then after I further looked at the CoW code I start to question myself on
why CoW would trigger at all even with an available fault mutex, since for
private mappings mapcount should be 1:

if (page_mapcount(old_page) == 1 && PageAnon(old_page)) {
if (!PageAnonExclusive(old_page))
page_move_anon_rmap(old_page, vma);
if (likely(!unshare))
set_huge_ptep_writable(vma, haddr, ptep);

return 0;

There could still be something else I didn't catch, even though that may
not be relevant to this patchset anymore.

I'll wait a few more hours for other reviewers, then prepare a new version
with the corrected commit message.


Peter Xu