Re: [PATCH 0/1] mm: restore full accuracy in COW page reuse
From: David Hildenbrand
Date: Fri Jan 15 2021 - 04:01:17 EST
On 10.01.21 01:44, Andrea Arcangeli wrote:
> Hello Andrew and everyone,
>
> Once we agree that COW page reuse requires full accuracy, the next
> step is to re-apply 17839856fd588f4ab6b789f482ed3ffd7c403e1f and to
> return going in that direction.
After stumbling over the heated discussion around this, I wanted to
understand the details and the different opinions. I tried to summarize
in my simple words (bear with me) what happened and how I think we can
proceed from here. Maybe that helps.
====
What happened:
1) We simplified handling of faults on write-protected pages (page table
entries): we changed the logic when we can reuse a page ("simply
unprotecting it"), and when we have to copy it instead (COW). The
essence of the simplification is, that we only reuse a page if we are
the only single user of the page, meaning page_count(page) == 1, and the
page is mapped into a single process (page_mapcount(page) == 1);
otherwise we copy it. Simple.
2) The old code was complicated and there are GUP (e.g., RDMA, VFIO)
cases that were broken in various ways in the old code already: most
prominently fork(). As one example, it would have been possible for
mprotect(READ) memory to still get modified by GUP users like RDMA.
Write protection (AFAIU via any mechanism) after GUP pinned a page was
not effective; the page was not copied.
3) Speculative pagecache reference can temporarily bump up the
page_count(page), resulting in false positives. We could see
page_count(page) > 1, although we're the single instance that actually
uses a page. In the simplified code, we might copy a page although not
necessary (I cannot tell how often that actually happens).
4) clear_refs(4) ("measure approximately how much memory a process is
using"), uffd-wp (let's call it "lightweight write-protection, handling
the actual fault in user space"), and mprotect(READ) all write-protect
page table entries to generate faults on next write access. With the
simplified code, we will COW whenever we find the page_count(page) > 1.
The simplification seemed to regress clear_refs and uffdio-wp code
(AFAIU in case of uffd-wp, it results in memory corruption). But looks
like we can mostly fix it by adding more extensive locking.
5) Mechanisms like GUP (AFAIU including Direct I/O) also takes
references on pages, increasing page_count(). With the simplification,
we might now end up copying a page, although there is "somewhat" only a
single user/"process" involved.
One example is RDMA: if we read memory using RDMA and mprotect(READ)
such memory, we might end up copying the underlying page on the next
write: suddenly, RDMA is disconnected and will no longer read what is
getting written. Not to mention, we consume more memory. AFAIU, other
examples include direct I/O (e.g., write() with O_DIRECT).
AFAIU, a more extreme case is probably VFIO: A VM with VFIO (e.g.,
passthrough of a PCI device) can essentially be corrupted by "echo 4 >
/proc/[pid]/clear_refs".
6) While some people think it is okay to break GUP further, as it is
already broken in various other ways, other people think this is
changing something that used to work (AFAIU a user-visible change) with
little benefit.
7) There is no easy way to detect if a page really was pinned: we might
have false positives. Further, there is no way to distinguish if it was
pinned with FOLL_WRITE or not (R vs R/W). To perform reliable tracking
we most probably would need more counters, which we cannot fit into
struct page. (AFAIU, for huge pages it's easier).
However, AFAIU, even being able to detect if (and how) a page was pinned
would not completely help to solve the puzzle.
8) We have a vmsplice security issue that has to be fixed by touching
the code in question. A forked child process can read memory content of
its parent, which was modified by the parent after fork. AFAIU, the fix
will further lock us in into the direction of the code we are heading.
9) The simplification is part of v5.10, which is a LTS release. AFAIU,
that one needs fixing, too.
I see the following possible directions we can head
A) Keep the simplification. Try fixing the fallout. Keep the GUP cases
broken or make mprotect() fail when detecting such a scenario;
AFAIU, both are user-visible changes.
B) Keep the simplification. Try fixing the fallout. Fix GUP cases that
used to work; AFAIU fixing this is the hard/impossible part, and is
undesired by some people..
C) Revert the simplification for now. Go back to the drawing board and
use what we learned to come up with a simplification that (all? )
people are happy with.
D) Revert the simplification: turns out the code could not get
simplified to this extend. We learned a lot, though.
======
Please let me know in case I messed up anything and/or missed important
points.
--
Thanks,
David / dhildenb