[PATCH RFC v2 0/9] mm: COW fixes part 1: fix the COW security issue for THP and swap

From: David Hildenbrand
Date: Wed Jan 26 2022 - 04:57:10 EST


This series is the result of the discussion on the previous approach [1].
More information on the general COW issues can be found there.


This series attempts to optimize and streamline the COW logic for ordinary
anon pages and THP anon pages, fixing two remaining instances of
CVE-2020-29374 in do_swap_page() and do_huge_pmd_wp_page(): information can
leak from a parent process to a child process via anonymous pages shared
during fork().

This issue, including other related COW issues, has been summarized in [2]:
"
1. Observing Memory Modifications of Private Pages From A Child Process

Long story short: process-private memory might not be as private as you
think once you fork(): successive modifications of private memory
regions in the parent process can still be observed by the child
process, for example, by smart use of vmsplice()+munmap().

The core problem is that pinning pages readable in a child process, such
as done via the vmsplice system call, can result in a child process
observing memory modifications done in the parent process the child is
not supposed to observe. [1] contains an excellent summary and [2]
contains further details. This issue was assigned CVE-2020-29374 [9].

For this to trigger, it's required to use a fork() without subsequent
exec(), for example, as used under Android zygote. Without further
details about an application that forks less-privileged child processes,
one cannot really say what's actually affected and what's not -- see the
details section the end of this mail for a short sshd/openssh analysis.

While commit 17839856fd58 ("gup: document and work around "COW can break
either way" issue") fixed this issue and resulted in other problems
(e.g., ptrace on pmem), commit 09854ba94c6a ("mm: do_wp_page()
simplification") re-introduced part of the problem unfortunately.

The original reproducer can be modified quite easily to use THP [3] and
make the issue appear again on upstream kernels. I modified it to use
hugetlb [4] and it triggers as well. The problem is certainly less
severe with hugetlb than with THP; it merely highlights that we still
have plenty of open holes we should be closing/fixing.

Regarding vmsplice(), the only known workaround is to disallow the
vmsplice() system call ... or disable THP and hugetlb. But who knows
what else is affected (RDMA? O_DIRECT?) to achieve the same goal -- in
the end, it's a more generic issue.
"

This security issue was first reported by Jann Horn on 27 May 2020 and it
currently affects anonymous pages during swapin, anonymous THP and hugetlb.
This series tackles anonymous pages during swapin and anonymous THP:
* do_swap_page() for handling COW on PTEs *during* swapin directly
* do_huge_pmd_wp_page() for handling COW on PMD-mapped THP during write
faults

With this series, we'll apply the same COW logic we have in do_wp_page()
to all swappable anon pages: don't reuse (map writable) the page in
case there are additional references (page_count() != 1). All users of
reuse_swap_page() are remove, and consequently reuse_swap_page() is
removed.

In general, we're struggling with the following COW-related issues:
(1) "missed COW": we miss to copy on write and reuse the page (map it
writable) although we must copy because there are pending references
from another process to this page. The result is a security issue.
(2) "wrong COW": we copy on write although we wouldn't have to and
shouldn't: if there are valid GUP references, they will become out of
sync with the pages mapped into the page table. We fail to detect that
such a page can be reused safely, especially if never more than a
single process mapped the page. The result is an intra process
memory corruption.
(3) "unnecessary COW": we copy on write although we wouldn't have to:
performance degradation and temporary increases swap+memory consumption
can be the result.

While this series fixes (1) for swappable anon pages, it tries to reduce
reported cases of (3) first as good and easy as possible to limit the
impact when streamlining. The individual patches try to describe in which
cases we will run into (3).

This series certainly makes (2) worse for THP, because a THP will now get
PTE-mapped on write faults if there are additional references, even if
there was only ever a single process involved: once PTE-mapped, we'll copy
each and every subpage and won't reuse any subpage as long as the
underlying compound page wasn't split.

I'm working on an approach to fix (2) and improve (3): PageAnonExclusive to
mark anon pages that are exclusive to a single process, allow GUP pins only
on such exclusive pages, and allow turning exclusive pages shared
(clearing PageAnonExclusive) only if there are no GUP pins. Anon pages with
PageAnonExclusive set never have to be copied during write faults, but
eventually during fork() if they cannot be turned shared. The improved
reuse logic in this series will essentially also be the logic to reset
PageAnonExclusive. This work will certainly take a while, but I'm planning
on sharing details before having code fully ready.


#1-#5 can be applied independently of the rest. #6-#9 are mostly only
cleanups related to reuse_swap_page().


Notes:
* For now, I'll leave hugetlb code untouched: "unnecessary COW" might
easily break existing setups because hugetlb pages are a scarce resource
and we could just end up having to crash the application when we run out
of hugetlb pages. We have to be very careful and the security aspect with
hugetlb is most certainly less relevant than for unprivileged anon pages.
* Instead of lru_add_drain() we might actually just drain the lru_add list
or even just remove the single page of interest from the lru_add list.
This would require a new helper function, and could be added if the
conditional lru_add_drain() turn out to be a problem.
* I am pretty much confused about the swapcache handling in khugepaged
code. This certainly needs an additional pair of eyes.
* I extended the test case already included in [1] to also test for the
newly found do_swap_page() case. I'll send that out separately once/if
this part was reviewed.

[1] https://lkml.kernel.org/r/20211217113049.23850-1-david@xxxxxxxxxx
[2] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@xxxxxxxxxx

David Hildenbrand (9):
mm: optimize do_wp_page() for exclusive pages in the swapcache
mm: optimize do_wp_page() for fresh pages in local LRU pagevecs
mm: slightly clarify KSM logic in do_swap_page()
mm: streamline COW logic in do_swap_page()
mm/huge_memory: streamline COW logic in do_huge_pmd_wp_page()
mm/khugepaged: remove reuse_swap_page() usage
mm/swapfile: remove reuse_swap_page()
mm/huge_memory: remove stale page_trans_huge_mapcount()
mm/huge_memory: remove stale locking logic from __split_huge_pmd()

include/linux/mm.h | 5 --
include/linux/swap.h | 4 --
mm/huge_memory.c | 99 ++++++----------------------------
mm/khugepaged.c | 16 ++++--
mm/memory.c | 123 +++++++++++++++++++++++++++++++------------
mm/swapfile.c | 104 ------------------------------------
6 files changed, 119 insertions(+), 232 deletions(-)


base-commit: e783362eb54cd99b2cac8b3a9aeac942e6f6ac07
--
2.34.1