Re: [RFC PATCH 1/2] x86: Don't let pgprot_modify() change the page encryption bit

From: Andy Lutomirski
Date: Wed Sep 11 2019 - 00:19:38 EST

On Tue, Sep 10, 2019 at 12:26 PM Thomas HellstrÃm (VMware)
<thomas_os@xxxxxxxxxxxx> wrote:
> On 9/10/19 6:11 PM, Andy Lutomirski wrote:
> >
> >> On Sep 5, 2019, at 8:24 AM, Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote:
> >>
> >>> On Thu, Sep 05, 2019 at 05:21:24PM +0200, Thomas HellstrÃm (VMware) wrote:
> >>>> On 9/5/19 4:15 PM, Dave Hansen wrote:
> >>>> Hi Thomas,
> >>>>
> >>>> Thanks for the second batch of patches! These look much improved on all
> >>>> fronts.
> >>> Yes, although the TTM functionality isn't in yet. Hopefully we won't have to
> >>> bother you with those though, since this assumes TTM will be using the dma
> >>> API.
> >> Please take a look at dma_mmap_prepare and dma_mmap_fault in this
> >> branch:
> >>
> >>
> >>
> >> they should allow to fault dma api pages in the page fault handler. But
> >> this is totally hot off the press and not actually tested for the last
> >> few patches. Note that I've also included your two patches from this
> >> series to handle SEV.
> > I read that patch, and it seems like youâve built in the assumption that all pages in the mapping use identical protection or, if not, that the same fake vma hack that TTM already has is used to fudge around it. Could it be reworked slightly to avoid this?
> >
> > I wonder if itâs a mistake to put the encryption bits in vm_page_prot at all.
> From my POW, the encryption bits behave quite similar in behaviour to
> the caching mode bits, and they're also in vm_page_prot. They're the
> reason TTM needs to modify the page protection in the fault handler in
> the first place.
> The problem seen in TTM is that we want to be able to change the
> vm_page_prot from the fault handler, but it's problematic since we have
> the mmap_sem typically only in read mode. Hence the fake vma hack. From
> what I can tell it's reasonably well-behaved, since pte_modify() skips
> the bits TTM updates, so mprotect() and mremap() works OK. I think
> split_huge_pmd may run into trouble, but we don't support it (yet) with
> TTM.

One thing I'm confused about: does TTM move individual pages between
main memory and device memory or does it move whole mappings? If it
moves individual pages, then a single mapping could have PTEs from
dma_alloc_coherent() space and from PCI space. How can this work with
vm_page_prot? I guess you might get lucky and have both have the same
protection bits, but that seems like an unfortunate thing to rely on.

As a for-real example, take a look at arch/x86/entry/vdso/vma.c. The
"vvar" VMA contains multiple pages that are backed by different types
of memory. There's a page of ordinary kernel memory. Historically
there was a page of genuine MMIO memory, but that's gone now. If the
system is a SEV guest with pvclock enabled, then there's a page of
decrypted memory. They all share a VMA, they're instantiated in
.fault, and there is no hackery involved. Look at vvar_fault().

So, Christoph, can't you restructure your code a bit to compute the
caching and encryption state per page instead of once in
dma_mmap_prepare() and insert the pages accordingly? You might need
to revert 6d958546ff611c9ae09b181e628c1c5d5da5ebda depending on
exactly how you structure it.

> We could probably get away with a WRITE_ONCE() update of the
> vm_page_prot before taking the page table lock since
> a) We're locking out all other writers.
> b) We can't race with another fault to the same vma since we hold an
> address space lock ("buffer object reservation")
> c) When we need to update there are no valid page table entries in the
> vma, since it only happens directly after mmap(), or after an
> unmap_mapping_range() with the same address space lock. When another
> reader (for example split_huge_pmd()) sees a valid page table entry, it
> also sees the new page protection and things are fine.
> But that would really be a special case. To solve this properly we'd
> probably need an additional lock to protect the vm_flags and
> vm_page_prot, taken after mmap_sem and i_mmap_lock.

This is all horrible IMO.