[PATCH v1 00/15] mm: COW fixes part 2: reliable GUP pins of anonymous pages

From: David Hildenbrand
Date: Tue Mar 08 2022 - 09:15:00 EST


This series is the result of the discussion on the previous approach [2].
More information on the general COW issues can be found there. It is based
on v5.17-rc7 and [1], which resides in -mm and -next:
[PATCH v3 0/9] mm: COW fixes part 1: fix the COW security issue for
THP and swap

v1 is located at:
https://github.com/davidhildenbrand/linux/tree/cow_fixes_part_2_v1

This series fixes memory corruptions when a GUP pin (FOLL_PIN) was taken
on an anonymous page and COW logic fails to detect exclusivity of the page
to then replacing the anonymous page by a copy in the page table: The
GUP pin lost synchronicity with the pages mapped into the page tables.

This issue, including other related COW issues, has been summarized in [3]
under 3):
"
3. Intra Process Memory Corruptions due to Wrong COW (FOLL_PIN)

page_maybe_dma_pinned() is used to check if a page may be pinned for
DMA (using FOLL_PIN instead of FOLL_GET). While false positives are
tolerable, false negatives are problematic: pages that are pinned for
DMA must not be added to the swapcache. If it happens, the (now pinned)
page could be faulted back from the swapcache into page tables
read-only. Future write-access would detect the pinning and COW the
page, losing synchronicity. For the interested reader, this is nicely
documented in feb889fb40fa ("mm: don't put pinned pages into the swap
cache").

Peter reports [8] that page_maybe_dma_pinned() as used is racy in some
cases and can result in a violation of the documented semantics:
giving false negatives because of the race.

There are cases where we call it without properly taking a per-process
sequence lock, turning the usage of page_maybe_dma_pinned() racy. While
one case (clear_refs SOFTDIRTY tracking, see below) seems to be easy to
handle, there is especially one rmap case (shrink_page_list) that's hard
to fix: in the rmap world, we're not limited to a single process.

The shrink_page_list() issue is really subtle. If we race with
someone pinning a page, we can trigger the same issue as in the FOLL_GET
case. See the detail section at the end of this mail on a discussion how
bad this can bite us with VFIO or other FOLL_PIN user.

It's harder to reproduce, but I managed to modify the O_DIRECT
reproducer to use io_uring fixed buffers [15] instead, which ends up
using FOLL_PIN | FOLL_WRITE | FOLL_LONGTERM to pin buffer pages and can
similarly trigger a loss of synchronicity and consequently a memory
corruption.

Again, the root issue is that a write-fault on a page that has
additional references results in a COW and thereby a loss of
synchronicity and consequently a memory corruption if two parties
believe they are referencing the same page.
"

This series makes GUP pins (R/O and R/W) on anonymous pages fully reliable,
especially also taking care of concurrent pinning via GUP-fast,
for example, also fully fixing an issue reported regarding NUMA
balancing [4] recently. While doing that, it further reduces "unnecessary
COWs", especially when we don't fork()/KSM and don't swapout, and fixes the
COW security for hugetlb for FOLL_PIN.

In summary, we track via a pageflag (PG_anon_exclusive) whether a mapped
anonymous page is exclusive. Exclusive anonymous pages that are mapped
R/O can directly be mapped R/W by the COW logic in the write fault handler.
Exclusive anonymous pages that want to be shared (fork(), KSM) first have
to mark a mapped anonymous page shared -- which will fail if there are
GUP pins on the page. GUP is only allowed to take a pin on anonymous pages
that is exclusive. The PT lock is the primary mechanism to synchronize
modifications of PG_anon_exclusive. GUP-fast is synchronized either via the
src_mm->write_protect_seq or via clear/invalidate+flush of the relevant
page table entry.

Special care has to be taken about swap, migration, and THPs (whereby a
PMD-mapping can be converted to a PTE mapping and we have to track
information for subpages). Besides these, we let the rmap code handle most
magic. For reliable R/O pins of anonymous pages, we need FAULT_FLAG_UNSHARE
logic as part of our previous approach [2], however, it's now 100% mapcount
free and I further simplified it a bit.

#1 is a fix
#3-#9 are mostly rmap preparations for PG_anon_exclusive handling
#10 introduces PG_anon_exclusive
#11 uses PG_anon_exclusive and make R/W pins of anonymous pages
reliable
#12 is a preparation for reliable R/O pins
#13 and #14 is reused/modified GUP-triggered unsharing for R/O GUP pins
make R/O pins of anonymous pages reliable
#15 adds sanity check when (un)pinning anonymous pages

I'm not proud about patch #10, suggestions welcome. Patch #11 contains
excessive explanations and the main logic for R/W pins. #12 and #13
resemble what we proposed in the previous approach [2]. I consider the
general approach of #15 very nice and helpful, and I remember Linus even
envisioning something like that for finding BUGs, although we might want to
implement the sanity checks eventually differently

It passes my growing set of tests for "wrong COW" and "missed COW",
including the ones in [3] -- I'd really appreciate some experienced eyes
to take a close look at corner cases.

I'm planning on sending a part 3 that will remember PG_anon_exclusive for
ordinary swap entries: this will make FOLL_GET | FOLL_WRITE references
reliable and fix the memory corruptions for O_DIRECT -- as described in
[3] under 2) -- as well, as long as there is no fork().

The long term goal should be to convert relevant users of FOLL_GET to
FOLL_PIN, however, with part3 it wouldn't be required to fix the obvious
memory corruptions we are aware of. Once that's in place we can streamline
our COW logic for hugetlb to rely on page_count() as well and fix any
possible COW security issues.

[1] https://lkml.kernel.org/r/20220131162940.210846-1-david@xxxxxxxxxx
[2] https://lkml.kernel.org/r/20211217113049.23850-1-david@xxxxxxxxxx
[3] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@xxxxxxxxxx
[4] https://bugzilla.kernel.org/show_bug.cgi?id=215616


RFC -> v1:
* Rephrased/extended some patch descriptions+comments
* Tested on aarch64, ppc64 and x86_64
* "mm/rmap: convert RMAP flags to a proper distinct rmap_t type"
-> Added
* "mm/rmap: drop "compound" parameter from page_add_new_anon_rmap()"
-> Added
* "mm: remember exclusively mapped anonymous pages with PG_anon_exclusive"
-> Fixed __do_huge_pmd_anonymous_page() to recheck after temporarily
dropping the PT lock.
-> Use "reuse" label in __do_huge_pmd_anonymous_page()
-> Slightly simplify logic in hugetlb_cow()
-> In remove_migration_pte(), remove unrelated changes around
page_remove_rmap()
* "mm: support GUP-triggered unsharing of anonymous pages"
-> In handle_pte_fault(), trigger pte_mkdirty() only with
FAULT_FLAG_WRITE
-> In __handle_mm_fault(), extend comment regarding anonymous PUDs
* "mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared
anonymous page"
-> Added unsharing logic to gup_hugepte() and gup_huge_pud()
-> Changed return logic in __follow_hugetlb_must_fault(), making sure
that "unshare" is always set
* "mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are
exclusive when (un)pinning"
-> Slightly simplified sanity_check_pinned_pages()

David Hildenbrand (15):
mm/rmap: fix missing swap_free() in try_to_unmap() after
arch_unmap_one() failed
mm/hugetlb: take src_mm->write_protect_seq in
copy_hugetlb_page_range()
mm/memory: slightly simplify copy_present_pte()
mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and
page_try_dup_anon_rmap()
mm/rmap: convert RMAP flags to a proper distinct rmap_t type
mm/rmap: remove do_page_add_anon_rmap()
mm/rmap: pass rmap flags to hugepage_add_anon_rmap()
mm/rmap: drop "compound" parameter from page_add_new_anon_rmap()
mm/rmap: use page_move_anon_rmap() when reusing a mapped PageAnon()
page exclusively
mm/page-flags: reuse PG_slab as PG_anon_exclusive for PageAnon() pages
mm: remember exclusively mapped anonymous pages with PG_anon_exclusive
mm/gup: disallow follow_page(FOLL_PIN)
mm: support GUP-triggered unsharing of anonymous pages
mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared
anonymous page
mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are
exclusive when (un)pinning

fs/proc/page.c | 3 +-
include/linux/mm.h | 46 ++++++-
include/linux/mm_types.h | 8 ++
include/linux/page-flags.h | 124 +++++++++++++++++-
include/linux/rmap.h | 109 ++++++++++++++--
include/linux/swap.h | 15 ++-
include/linux/swapops.h | 25 ++++
include/trace/events/mmflags.h | 2 +-
kernel/events/uprobes.c | 2 +-
mm/gup.c | 103 ++++++++++++++-
mm/huge_memory.c | 122 +++++++++++++-----
mm/hugetlb.c | 137 ++++++++++++++------
mm/khugepaged.c | 2 +-
mm/ksm.c | 15 ++-
mm/memory-failure.c | 24 +++-
mm/memory.c | 221 ++++++++++++++++++++-------------
mm/memremap.c | 11 ++
mm/migrate.c | 40 +++++-
mm/mprotect.c | 8 +-
mm/page_alloc.c | 13 ++
mm/rmap.c | 95 ++++++++++----
mm/swapfile.c | 4 +-
mm/userfaultfd.c | 2 +-
23 files changed, 904 insertions(+), 227 deletions(-)

--
2.35.1