Re: [PATCH RFC 10/24] userfaultfd: wp: add WP pagetable tracking to x86

From: Peter Xu
Date: Thu Jan 24 2019 - 22:30:56 EST


On Thu, Jan 24, 2019 at 10:40:50AM -0500, Jerome Glisse wrote:
> On Thu, Jan 24, 2019 at 01:16:16PM +0800, Peter Xu wrote:
> > On Mon, Jan 21, 2019 at 10:09:38AM -0500, Jerome Glisse wrote:
> > > On Mon, Jan 21, 2019 at 03:57:08PM +0800, Peter Xu wrote:
> > > > From: Andrea Arcangeli <aarcange@xxxxxxxxxx>
> > > >
> > > > Accurate userfaultfd WP tracking is possible by tracking exactly which
> > > > virtual memory ranges were writeprotected by userland. We can't relay
> > > > only on the RW bit of the mapped pagetable because that information is
> > > > destroyed by fork() or KSM or swap. If we were to relay on that, we'd
> > > > need to stay on the safe side and generate false positive wp faults
> > > > for every swapped out page.
> >
> > (I'm trying to leave comments with my own understanding here; they
> > might not be the original purposes when Andrea proposed the idea.
> > Andrea, please feel free to chim in anytime especially if I am
> > wrong... :-)
> >
> > >
> > > So you want to forward write fault (of a protected range) to user space
> > > only if page is not write protected because of fork(), KSM or swap.
> > >
> > > This write protection feature is only for anonymous page right ? Other-
> > > wise how would you protect a share page (ie anyone can look it up and
> > > call page_mkwrite on it and start writting to it) ?
> >
> > AFAIU we want to support shared memory too in the future. One example
> > I can think of is current QEMU usage with DPDK: we have two processes
> > sharing the guest memory range. So indeed this might not work if
> > there are unknown/malicious users of the shared memory, however in
> > many use cases the users are all known and AFAIU we should just write
> > protect all these users then we'll still get notified when any of them
> > write to a page.
> >
> > >
> > > So for anonymous page for fork() the mapcount will tell you if page is
> > > write protected for COW. For KSM it is easy check the page flag.
> >
> > Yes I agree that KSM should be easy. But for COW, please consider
> > when we write protect a page that was shared and RW removed due to
> > COW. Then when we page fault on this page should we report to the
> > monitor? IMHO we can't know if without a specific bit in the PTE.
> >
> > >
> > > For swap you can use the page lock to synchronize. A page that is
> > > write protected because of swap is write protected because it is being
> > > write to disk thus either under page lock, or with PageWriteback()
> > > returning true while write is on going.
> >
> > For swap I think the major problem is when the page was swapped out of
> > main memory and then we write to the page (which was already a swap
> > entry now). Then we'll first swap in the page into main memory again,
> > but then IMHO we will face the similar issue like COW above - we can't
> > judge whether this page was write protected by uffd-wp at all. Of
> > course here we can detect the VMA flags and assuming it's write
> > protected if the UFFD_WP flag was set on the VMA flag, however we'll
> > also mark those pages which were not write protected at all hence
> > it'll generate false positives of write protection messages. This
> > idea can apply too to above COW use case. As a conclusion, in these
> > use cases we should not be able to identify explicitly on page
> > granularity write protection if without a specific _PAGE_UFFD_WP bit
> > in the PTE entries.
>
> So i need to think a bit more on this, probably not right now
> but just so i get the chain of event properly:
> 1 - user space ioctl UFD to write protect a range
> 2 - UFD set a flag on the vma and update CPU page table

A trivial supplement to these two steps to be clear: the change to VMA
flags and PTE permissions are different steps. Say, to write protect
a newly mmap()ed region, we need to do:

(a) ioctl UFFDIO_REGISTER upon the range: this will properly attach
the VM_UFFD_WP flag upon the VMA object, and...

(b) ioctl UFFDIO_WRITEPROTECT upon the range again: this will
properly apply the new uffd-wp bit and write protect the
PTEs/PMDs.

Note that the range specified in step (b) could also be part of the
buffer, so it does not need to cover the whole VMA, and it's in page
granularity.

> 3 - page can be individualy write faulted and it sends a
> signal to UFD listener and they handle the fault
> 4 - UFD kernel update the page table once userspace have
> handled the fault and sent result to UFD. At this point
> the vma still has the UFD write protect flag set.

Yes. As explained above, the VMA can have the VM_UFFD_WP flag even if
none of the PTEs underneath was write protected.

>
> So at any point in time in a range you might have writeable
> pte that correspond to already handled UFD write fault. Now
> if COW,KSM or swap happens on those then on the next write
> fault you do not want to send a signal to userspace but handle
> the fault just as usual ?

Yes, if the PTE has already resolved the uffd write protection and
then it will be just like a normal PTE, because when resolving the
uffd-wp page fault we'll also remove the special uffd-wp bit on the
PTE/PMD.

And IMHO actually what's more special here is when we write protect a
shared private page that is for COW (I'll skip KSM since it looks very
like this case IIUC): here due to COW the PTE already lost the RW bit,
and here when we do uffd-wp upon this page we'll simply apply the
uffd-wp bit only to mark that this PTE was especially write protected
by userfaults. And when we want to resolve the uffd-wp for such a PTE
we'll first try to do COW if it is shared by others by checking
against page_mapcount().

>
> I believe this is the event flow, so i will ponder on this some
> more :)

Yes please. :) The workflow of the new ioctl()s was also mentioned in
the cover letter. Please feel free to have a look too.

Thanks,

--
Peter Xu