Re: [PATCH mm-unstable] mm/khugepaged: fix collapse_pte_mapped_thp() versus uffd

From: Hugh Dickins
Date: Mon Aug 21 2023 - 22:52:01 EST


On Mon, 21 Aug 2023, Jann Horn wrote:
> On Mon, Aug 21, 2023 at 9:51 PM Hugh Dickins <hughd@xxxxxxxxxx> wrote:
> > Jann Horn demonstrated how userfaultfd ioctl UFFDIO_COPY into a private
> > shmem mapping can add valid PTEs to page table collapse_pte_mapped_thp()
> > thought it had emptied: page lock on the huge page is enough to protect
> > against WP faults (which find the PTE has been cleared), but not enough
> > to protect against userfaultfd. "BUG: Bad rss-counter state" followed.
> >
> > retract_page_tables() protects against this by checking !vma->anon_vma;
> > but we know that MADV_COLLAPSE needs to be able to work on private shmem
> > mappings, even those with an anon_vma prepared for another part of the
> > mapping; and we know that MADV_COLLAPSE needs to work on shared shmem
> > mappings which are userfaultfd_armed(). Whether it needs to work on
> > private shmem mappings which are userfaultfd_armed(), I'm not so sure:
> > but assume that it does.
>
> I think we couldn't rely on anon_vma here anyway, since holding the
> mmap_lock in read mode doesn't prevent concurrent creation of an
> anon_vma?

We would have had to do the same as in retract_page_tables() (which
doesn't even have mmap_lock for read): recheck !vma->anon_vma after
finally acquiring ptlock. But the !anon_vma limitation is certainly
not acceptable here anyway.

>
> > Just for this case, take the pmd_lock() two steps earlier: not because
> > it gives any protection against this case itself, but because ptlock
> > nests inside it, and it's the dropping of ptlock which let the bug in.
> > In other cases, continue to minimize the pmd_lock() hold time.
>
> Special-casing userfaultfd like this makes me a bit uncomfortable; but
> I also can't find anything other than userfaultfd that would insert
> pages into regions that are khugepaged-compatible, so I guess this
> works?

I'm as sure as I can be that it's solely because userfaultfd breaks
the usual rules here (and in fairness, IIRC Andrea did ask my permission
before making it behave that way on shmem, COWing without a source page).

Perhaps something else will want that same behaviour in future (it's
tempting, but difficult to guarantee correctness); for now, it is just
userfaultfd (but by saying "_armed" rather than "_missing", I'm half-
expecting uffd to add more such exceptional modes in future).

>
> I guess an alternative would be to use a spin_trylock() instead of the
> current pmd_lock(), and if that fails, temporarily drop the page table
> lock and then restart from step 2 with both locks held - and at that
> point the page table scan should be fast since we expect it to usually
> be empty.

That's certainly a good idea, if collapse on userfaultfd_armed private
is anything of a common case (I doubt, but I don't know). It may be a
better idea anyway (saving a drop and retake of ptlock).

I gave it a try, expecting to end up with something that would lead
me to say "I tried it, but it didn't work out well"; but actually it
looks okay to me. I wouldn't say I prefer it, but it seems reasonable,
and no more complicated (as Peter rightly observes) than the original.

It's up to you and Peter, and whoever has strong feelings about it,
to choose between them: I don't mind (but I shall be sad if someone
demands that I indent that comment deeper - I'm not a fan of long
multi-line comments near column 80).


[PATCH mm-unstable v2] mm/khugepaged: fix collapse_pte_mapped_thp() versus uffd

Jann Horn demonstrated how userfaultfd ioctl UFFDIO_COPY into a private
shmem mapping can add valid PTEs to page table collapse_pte_mapped_thp()
thought it had emptied: page lock on the huge page is enough to protect
against WP faults (which find the PTE has been cleared), but not enough
to protect against userfaultfd. "BUG: Bad rss-counter state" followed.

retract_page_tables() protects against this by checking !vma->anon_vma;
but we know that MADV_COLLAPSE needs to be able to work on private shmem
mappings, even those with an anon_vma prepared for another part of the
mapping; and we know that MADV_COLLAPSE needs to work on shared shmem
mappings which are userfaultfd_armed(). Whether it needs to work on
private shmem mappings which are userfaultfd_armed(), I'm not so sure:
but assume that it does.

Now trylock pmd lock without dropping ptlock (suggested by jannh): if
that fails, drop and retake ptlock around taking pmd lock, and just in
the uffd private case, go back to recheck and empty the page table.

Reported-by: Jann Horn <jannh@xxxxxxxxxx>
Closes: https://lore.kernel.org/linux-mm/CAG48ez0FxiRC4d3VTu_a9h=rg5FW-kYD5Rg5xo_RDBM0LTTqZQ@xxxxxxxxxxxxxx/
Fixes: 1043173eb5eb ("mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()")
Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx>
---
mm/khugepaged.c | 39 +++++++++++++++++++++++++++++----------
1 file changed, 29 insertions(+), 10 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 40d43eccdee8..ad1c571772fe 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1476,7 +1476,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
struct page *hpage;
pte_t *start_pte, *pte;
pmd_t *pmd, pgt_pmd;
- spinlock_t *pml, *ptl;
+ spinlock_t *pml = NULL, *ptl;
int nr_ptes = 0, result = SCAN_FAIL;
int i;

@@ -1572,9 +1572,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
haddr, haddr + HPAGE_PMD_SIZE);
mmu_notifier_invalidate_range_start(&range);
notified = true;
- start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
- if (!start_pte) /* mmap_lock + page lock should prevent this */
- goto abort;
+ spin_lock(ptl);
+recheck:
+ start_pte = pte_offset_map(pmd, haddr);
+ VM_BUG_ON(!start_pte); /* mmap_lock + page lock should prevent this */

/* step 2: clear page table and adjust rmap */
for (i = 0, addr = haddr, pte = start_pte;
@@ -1608,20 +1609,36 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
nr_ptes++;
}

- pte_unmap_unlock(start_pte, ptl);
+ pte_unmap(start_pte);

/* step 3: set proper refcount and mm_counters. */
if (nr_ptes) {
page_ref_sub(hpage, nr_ptes);
add_mm_counter(mm, mm_counter_file(hpage), -nr_ptes);
+ nr_ptes = 0;
}

- /* step 4: remove page table */
+ /* step 4: remove empty page table */
+ if (!pml) {
+ pml = pmd_lockptr(mm, pmd);
+ if (pml != ptl && !spin_trylock(pml)) {
+ spin_unlock(ptl);
+ spin_lock(pml);
+ spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+ /*
+ * pmd_lock covers a wider range than ptl, and (if split from mm's
+ * page_table_lock) ptl nests inside pml. The less time we hold pml,
+ * the better; but userfaultfd's mfill_atomic_pte() on a private VMA
+ * inserts a valid as-if-COWed PTE without even looking up page cache.
+ * So page lock of hpage does not protect from it, so we must not drop
+ * ptl before pgt_pmd is removed, so uffd private needs rechecking.
+ */
+ if (userfaultfd_armed(vma) &&
+ !(vma->vm_flags & VM_SHARED))
+ goto recheck;
+ }
+ }

- /* Huge page lock is still held, so page table must remain empty */
- pml = pmd_lock(mm, pmd);
- if (ptl != pml)
- spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd);
pmdp_get_lockless_sync();
if (ptl != pml)
@@ -1648,6 +1665,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
}
if (start_pte)
pte_unmap_unlock(start_pte, ptl);
+ if (pml && pml != ptl)
+ spin_unlock(pml);
if (notified)
mmu_notifier_invalidate_range_end(&range);
drop_hpage:
--
2.35.3