[PATCH 0/2] page_count can't be used to decide when wp_page_copy
From: Andrea Arcangeli
Date: Thu Jan 07 2021 - 15:06:01 EST
Hello,
I prepared in 2/2 a fix to make UFFDIO_WRITEPROTECT and
clear_refs_write cope with page_count in do_wp_page. It'd stack
perfectly on top of 1/2 from will which fixes an orthogonal regression
and it'd need to be applied first since its Fixes tag comes first.
I hope this patchset shows and my initial my answer in
https://lkml.kernel.org/r/X+PoXCizo392PBX7@xxxxxxxxxx shows I tried to
keep an open mind and to try to fix what
09854ba94c6aad7886996bfbee2530b3d8a7f4f4 broke. Even in the commit of
2/2 I wrote "is completely correct" despite I had to change my mind
about that.
It turns out the memory corruption caused by the breakage in the TLB
flushing is a walk in the park to fix for clear_refs and
UFFDIO_WRITEPROTECT, that is only the tip of the icerberg.
To simplify, let's forget the mmap_read_lock and let's assume we
hypothetically throw away the mmap_read_lock from the kernel and
UFFDIO_WRITEPROTECT and clear_refs and everything else takes the
mmap_write_lock only.
Even then, clear_refs and UFFDIO_WRITEPROTECT will remain broken if
the memory they're wrprotecting is GUP pinned by a secondary MMU or
RDMA or something, that is reading the memory through a read GUP
pin.
You can only make fork safe from the page_count by COWing any page
that has a GUP pin, because fork is actually allowed to COW (and that
will also fix the longstanding fork vs threads vs GUP race as result,
which I tried already once in
https://lkml.kernel.org/r/20090311165833.GI27823@random.random
). However it's fundamentally flawed and forbidden to COW after
clear_refs and UFFDIO_WRITEPROTECT, if fork or clone have never been
called and there's any GUP pin on the pages that were wrprotected.
In other words, the page_count check in do_wp_page, extends the old
fork vs gup vs threads race, so that it also happens within a single
thread without fork() involvement.
The above scenario is even more explicit and unfixable when it's not
just a single page but something bigger mapped by multiple processes
that was GUP pinned.
Either COWs would have to be forbidden and features clear_refs dropped
from the kernel and mprotect also would be strictly forbidden to ever
leave any pte_write bit clear for any reason, or do_wp_page requires
full accuracy on exclusive pages in MAP_PRIVATE private COWs or
MAP_ANON do_wp_page.
In simple terms: the page_count check in do_wp_page makes it
impossible to wrprotect memory, if such memory is under a !FOLL_WRITE
GUP pin.
It's a feature (not a bug) that the GUP pin must not trigger a COW,
and this is also already explicitly documented in comments in the
current source in places that are still using mapcount, or we'd be
probably dealing with more breakage than what's reproducible right
now.
Can we take a step back and start looking at what started all this VM
breakage? I mean specifically Jann's discovery that parent can attack
the child after the child does drop privs by using vmsplice long term
unprivileged GUP pins?
vmsplice syscall API is insecure allowing long term GUP PINs without
privilege.
Before touching any of the COW code, something had to be done on
vmsplice because even after 09854ba94c6aad7886996bfbee2530b3d8a7f4f4
there's still no way to way to tame the other VM breakage and working
DoS caused by vmsplice. (I already sent the vmsplice DoS PoC exploit
to who of may concern on Aug 27 2020)
zygote before worrying about COWs, needs to block vmsplice in the
child (and io_uring until it's fixed) with seccomp by default until
the syscall become privileged.
I also recommended such change to podman default allowlist, but
apparently it wasn't committed checking below, vmsplice still there in
the allowlist unfortunately. I'll try to suggest it again in a follow
up.
https://github.com/containers/common/blob/master/pkg/seccomp/seccomp.json
io_uring unlike vmsplice can remain unprivileged, but it needs to use
the mmu notifier to make those GUP pins become VM neutral.
After io_uring is fixed with mmu notifier, and vmsplice becomes a
privileged syscall, the concern that remains for the zygote model is
on par with the fact that there is RAM carelessly shared with L1 and
L2 cache also shared between parent and child. It'll take DMA and
burning the flash in order to keep poking for the parent to write at
the wrong time. So the phone would get so hot or the battery would run
out of juice before the attack can expose data from the parent. So for
an attacker it may be easier to look for a side channel with flush and
reload on the shared L2 that doesn't rely on more costly GUP transient
pins.
So the concern that started all this, once vmsplice and io_uring are
both fixes, in my view becomes theoretical.
Especially on enterprise (non-embedded), this issue is not even
theoretical but it's fully irrelevant, since execve has to be used
after drop privs (or the app needs to use MADV_DONTFORK or unshare all
memory if using a jailer that doesn't execve) to avoid the
aforementioned side channel to remain. Only RAM constrained embedded
devices are justified to take shortcuts with the ensuing side channel
security concern that emerges from it.
For all the above reasons, because so far the cure has been worse than
the disease itself, I'd recommend enterprise distro kernels to ignore
the zygote embedded model attack for the time being and not to
backport anything in this regard. What should not be backported,
specifically starts in 17839856fd588f4ab6b789f482ed3ffd7c403e1f.
17839856fd588f4ab6b789f482ed3ffd7c403e1f was supposed to be fine and
not break anything and unfortunately I was too busy while
17839856fd588f4ab6b789f482ed3ffd7c403e1f morphed into
09854ba94c6aad7886996bfbee2530b3d8a7f4f4, so I still miss a whole huge
discussion in that transition. I don't know what was fundamentally
flawed in 17839856fd588f4ab6b789f482ed3ffd7c403e1f yet.
All I comment about here is purely the current end result:
09854ba94c6aad7886996bfbee2530b3d8a7f4f4 not any of the intermediate
steps that brought us there.
I already mentioned the page_count wasn't workable in do_wp_page in
Message-ID: <20200527212005.GC31990@xxxxxxxxxx> on May 27 2020, quote:
"I don't see how we can check the page_count to decide if to COW or not
in the wrprotect fault, given [..] introduced in >="
Then Hugh said he wasn't happy about on 1 Sep 2020 in Message-ID:
<alpine.LSU.2.11.2008312207450.1212@eggly.anvils>.
In https://lkml.kernel.org/r/X+O49HrcK1fBDk0Q@xxxxxxxxxx I suggested
"I hope we can find a way put the page_mapcount back where there" and
now I have to double down and suggest that the page_count is
fundamentally unsafe in do_wp_page.
I see how the page_count would also fix the specific zygote
child->parent attack and I kept an open mind hoping it would just
solve all problems magically. So I tried to fix even clear_refs to
cope with it, but this is only the tip of the icerbeg of what really
breaks.
So in short I contextually self-NAK 2/2 of this patchset and we need
to somehow reverse 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 instead.
Thanks,
Andrea
Andrea Arcangeli (1):
mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending
Will Deacon (1):
mm: proc: Invalidate TLB after clearing soft-dirty page state
fs/proc/task_mmu.c | 26 ++++++++++++---
include/linux/mm.h | 46 +++++++++++++++++++++++++++
include/linux/mm_types.h | 5 +++
kernel/fork.c | 1 +
mm/memory.c | 69 ++++++++++++++++++++++++++++++++++++++++
mm/mprotect.c | 4 +++
6 files changed, 146 insertions(+), 5 deletions(-)