RE: [RFC PATCH 1/1] x86/mm: Mark CoCo VM pages invalid while moving between private and shared
From: Michael Kelley (LINUX)
Date: Mon Aug 28 2023 - 19:25:08 EST
From: kirill.shutemov@xxxxxxxxxxxxxxx <kirill.shutemov@xxxxxxxxxxxxxxx> Sent: Monday, August 28, 2023 3:14 PM
>
> On Mon, Aug 28, 2023 at 09:00:03PM +0000, Michael Kelley (LINUX) wrote:
> > From: kirill.shutemov@xxxxxxxxxxxxxxx <kirill.shutemov@xxxxxxxxxxxxxxx> Sent: Monday, August 28, 2023 9:13 AM
> > >
> > > On Mon, Aug 28, 2023 at 02:22:11PM +0000, Michael Kelley (LINUX) wrote:
> > > > From: Michael Kelley (LINUX) <mikelley@xxxxxxxxxxxxx> Sent: Tuesday, August 15, 2023 7:54 PM
> > > > >
> > > > > From: kirill.shutemov@xxxxxxxxxxxxxxx <kirill.shutemov@xxxxxxxxxxxxxxx> Sent: Sunday,
> > > > > August 6, 2023 3:20 PM
> > > > > >
> > > > > > On Thu, Jul 06, 2023 at 09:41:59AM -0700, Michael Kelley wrote:
> > > > > > > In a CoCo VM when a page transitions from private to shared, or vice
> > > > > > > versa, attributes in the PTE must be updated *and* the hypervisor must
> > > > > > > be notified of the change. Because there are two separate steps, there's
> > > > > > > a window where the settings are inconsistent. Normally the code that
> > > > > > > initiates the transition (via set_memory_decrypted() or
> > > > > > > set_memory_encrypted()) ensures that the memory is not being accessed
> > > > > > > during a transition, so the window of inconsistency is not a problem.
> > > > > > > However, the load_unaligned_zeropad() function can read arbitrary memory
> > > > > > > pages at arbitrary times, which could access a transitioning page during
> > > > > > > the window. In such a case, CoCo VM specific exceptions are taken
> > > > > > > (depending on the CoCo architecture in use). Current code in those
> > > > > > > exception handlers recovers and does "fixup" on the result returned by
> > > > > > > load_unaligned_zeropad(). Unfortunately, this exception handling and
> > > > > > > fixup code is tricky and somewhat fragile. At the moment, it is
> > > > > > > broken for both TDX and SEV-SNP.
> > > > > >
> > > > >
> > > > > Thanks for looking at this. I'm finally catching up after being out on
> > > > > vacation for a week.
> > > > >
> > > > > > I believe it is not fixed for TDX. Is it still a problem for SEV-SNP?
> > > > >
> > > > > I presume you meant "now fixed for TDX", which I agree with. Tom
> > > > > Lendacky has indicated that there's still a problem with SEV-SNP. He
> > > > > could fix that problem, but this approach of marking the pages
> > > > > invalid obviates the need for Tom's fix.
> > > > >
> > > > > >
> > > > > > > There's also a problem with the current code in paravisor scenarios:
> > > > > > > TDX Partitioning and SEV-SNP in vTOM mode. The exceptions need
> > > > > > > to be forwarded from the paravisor to the Linux guest, but there
> > > > > > > are no architectural specs for how to do that.
> > > > >
> > > > > The TD Partitioning case (and the similar SEV-SNP vTOM mode case) is
> > > > > what doesn't have a solution. To elaborate, with TD Partitioning, #VE
> > > > > is sent to the containing VM, not the main Linux guest VM. For
> > > > > everything except an EPT violation, the containing VM can handle
> > > > > the exception on behalf of the main Linux guest by doing the
> > > > > appropriate emulation. But for an EPT violation, the containing
> > > > > VM can only terminate the guest. It doesn't have sufficient context
> > > > > to handle a "valid" #VE with EPT violation generated due to
> > > > > load_unaligned_zeropad(). My proposed scheme of marking the
> > > > > pages invalid avoids generating those #VEs and lets TD Partitioning
> > > > > (and similarly for SEV-SNP vTOM) work as intended with a paravisor.
> > > > >
> > > > > > >
> > > > > > > To avoid these complexities of the CoCo exception handlers, change
> > > > > > > the core transition code in __set_memory_enc_pgtable() to do the
> > > > > > > following:
> > > > > > >
> > > > > > > 1. Remove aliasing mappings
> > > > > > > 2. Remove the PRESENT bit from the PTEs of all transitioning pages
> > > > > > > 3. Flush the TLB globally
> > > > > > > 4. Flush the data cache if needed
> > > > > > > 5. Set/clear the encryption attribute as appropriate
> > > > > > > 6. Notify the hypervisor of the page status change
> > > > > > > 7. Add back the PRESENT bit
> > > > > >
> > > > > > Okay, looks safe.
> > > > > >
> > > > > > > With this approach, load_unaligned_zeropad() just takes its normal
> > > > > > > page-fault-based fixup path if it touches a page that is transitioning.
> > > > > > > As a result, load_unaligned_zeropad() and CoCo VM page transitioning
> > > > > > > are completely decoupled. CoCo VM page transitions can proceed
> > > > > > > without needing to handle architecture-specific exceptions and fix
> > > > > > > things up. This decoupling reduces the complexity due to separate
> > > > > > > TDX and SEV-SNP fixup paths, and gives more freedom to revise and
> > > > > > > introduce new capabilities in future versions of the TDX and SEV-SNP
> > > > > > > architectures. Paravisor scenarios work properly without needing
> > > > > > > to forward exceptions.
> > > > > > >
> > > > > > > This approach may make __set_memory_enc_pgtable() slightly slower
> > > > > > > because of touching the PTEs three times instead of just once. But
> > > > > > > the run time of this function is already dominated by the hypercall
> > > > > > > and the need to flush the TLB at least once and maybe twice. In any
> > > > > > > case, this function is only used for CoCo VM page transitions, and
> > > > > > > is already unsuitable for hot paths.
> > > > > > >
> > > > > > > The architecture specific callback function for notifying the
> > > > > > > hypervisor typically must translate guest kernel virtual addresses
> > > > > > > into guest physical addresses to pass to the hypervisor. Because
> > > > > > > the PTEs are invalid at the time of callback, the code for doing the
> > > > > > > translation needs updating. virt_to_phys() or equivalent continues
> > > > > > > to work for direct map addresses. But vmalloc addresses cannot use
> > > > > > > vmalloc_to_page() because that function requires the leaf PTE to be
> > > > > > > valid. Instead, slow_virt_to_phys() must be used. Both functions
> > > > > > > manually walk the page table hierarchy, so performance is the same.
> > > > > >
> > > > > > Uhmm.. But why do we expected slow_virt_to_phys() to work on non-present
> > > > > > page table entries? It seems accident for me that it works now. Somebody
> > > > > > forgot pte_present() check.
> > > > >
> > > > > I didn't research the history of slow_virt_to_phys(), but I'll do so.
> > > > >
> > > >
> > > > The history of slow_virt_to_phys() doesn't show any discussion of
> > > > whether it is expected to work for non-present PTEs. However, the
> > > > page table walking is done by lookup_address(), which explicitly
> > > > *does* work for non-present PTEs. For example, lookup_address()
> > > > is used in cpa_flush() to find a PTE. cpa_flush() then checks to see if
> > > > the PTE is present.
> > >
> > > Which is totally fine thing to do. Present bit is the only info you can
> > > always rely to be valid for non-present PTE.
> > >
> > > But it is nitpicking on my side. Here you control lifecycle of the PTE, so
> > > you know that PFN will stay valid for the PTE.
> > >
> > > I guess the right thing to do is to use lookup_address() and get pfn from
> > > the PTE with the comment why it is valid.
> >
> > Each of the three implementations of the enc_status_change_prepare()/
> > enc_status_change_finish() callbacks needs to translate from vaddr to
> > pfn, and so would use lookup_address(). For convenience, I would
> > create a helper function that wraps lookup_address() and returns
> > a PFN. This helper function would be very similar to slow_virt_to_phys()
> > except for removing the shift to create a phys_addr_t from the PFN,
> > and adding back the offset within the page. Is that the approach you
> > had in mind?
>
> I guess in this case it is better to use slow_virt_to_phys(), but have a
> comment somewhere why it is self. Maybe also add comment inside
> slow_virt_to_phys() to indicate that we don't check present bit for a
> reason.
>
OK, works for me. I'll turn my original RFC patch into a small patch
set that's pretty much as the RFC version is coded, along with the
appropriate comments in slow_virt_to_phys(). Additional patches
in the set will remove code that becomes unnecessary or that can
be simplified.
Michael