RE: [RFC PATCH 1/1] x86/mm: Mark CoCo VM pages invalid while moving between private and shared

From: Michael Kelley (LINUX)
Date: Mon Aug 28 2023 - 10:23:13 EST


From: Michael Kelley (LINUX) <mikelley@xxxxxxxxxxxxx> Sent: Tuesday, August 15, 2023 7:54 PM
>
> From: kirill.shutemov@xxxxxxxxxxxxxxx <kirill.shutemov@xxxxxxxxxxxxxxx> Sent: Sunday,
> August 6, 2023 3:20 PM
> >
> > On Thu, Jul 06, 2023 at 09:41:59AM -0700, Michael Kelley wrote:
> > > In a CoCo VM when a page transitions from private to shared, or vice
> > > versa, attributes in the PTE must be updated *and* the hypervisor must
> > > be notified of the change. Because there are two separate steps, there's
> > > a window where the settings are inconsistent. Normally the code that
> > > initiates the transition (via set_memory_decrypted() or
> > > set_memory_encrypted()) ensures that the memory is not being accessed
> > > during a transition, so the window of inconsistency is not a problem.
> > > However, the load_unaligned_zeropad() function can read arbitrary memory
> > > pages at arbitrary times, which could access a transitioning page during
> > > the window. In such a case, CoCo VM specific exceptions are taken
> > > (depending on the CoCo architecture in use). Current code in those
> > > exception handlers recovers and does "fixup" on the result returned by
> > > load_unaligned_zeropad(). Unfortunately, this exception handling and
> > > fixup code is tricky and somewhat fragile. At the moment, it is
> > > broken for both TDX and SEV-SNP.
> >
>
> Thanks for looking at this. I'm finally catching up after being out on
> vacation for a week.
>
> > I believe it is not fixed for TDX. Is it still a problem for SEV-SNP?
>
> I presume you meant "now fixed for TDX", which I agree with. Tom
> Lendacky has indicated that there's still a problem with SEV-SNP. He
> could fix that problem, but this approach of marking the pages
> invalid obviates the need for Tom's fix.
>
> >
> > > There's also a problem with the current code in paravisor scenarios:
> > > TDX Partitioning and SEV-SNP in vTOM mode. The exceptions need
> > > to be forwarded from the paravisor to the Linux guest, but there
> > > are no architectural specs for how to do that.
>
> The TD Partitioning case (and the similar SEV-SNP vTOM mode case) is
> what doesn't have a solution. To elaborate, with TD Partitioning, #VE
> is sent to the containing VM, not the main Linux guest VM. For
> everything except an EPT violation, the containing VM can handle
> the exception on behalf of the main Linux guest by doing the
> appropriate emulation. But for an EPT violation, the containing
> VM can only terminate the guest. It doesn't have sufficient context
> to handle a "valid" #VE with EPT violation generated due to
> load_unaligned_zeropad(). My proposed scheme of marking the
> pages invalid avoids generating those #VEs and lets TD Partitioning
> (and similarly for SEV-SNP vTOM) work as intended with a paravisor.
>
> > >
> > > To avoid these complexities of the CoCo exception handlers, change
> > > the core transition code in __set_memory_enc_pgtable() to do the
> > > following:
> > >
> > > 1. Remove aliasing mappings
> > > 2. Remove the PRESENT bit from the PTEs of all transitioning pages
> > > 3. Flush the TLB globally
> > > 4. Flush the data cache if needed
> > > 5. Set/clear the encryption attribute as appropriate
> > > 6. Notify the hypervisor of the page status change
> > > 7. Add back the PRESENT bit
> >
> > Okay, looks safe.
> >
> > > With this approach, load_unaligned_zeropad() just takes its normal
> > > page-fault-based fixup path if it touches a page that is transitioning.
> > > As a result, load_unaligned_zeropad() and CoCo VM page transitioning
> > > are completely decoupled. CoCo VM page transitions can proceed
> > > without needing to handle architecture-specific exceptions and fix
> > > things up. This decoupling reduces the complexity due to separate
> > > TDX and SEV-SNP fixup paths, and gives more freedom to revise and
> > > introduce new capabilities in future versions of the TDX and SEV-SNP
> > > architectures. Paravisor scenarios work properly without needing
> > > to forward exceptions.
> > >
> > > This approach may make __set_memory_enc_pgtable() slightly slower
> > > because of touching the PTEs three times instead of just once. But
> > > the run time of this function is already dominated by the hypercall
> > > and the need to flush the TLB at least once and maybe twice. In any
> > > case, this function is only used for CoCo VM page transitions, and
> > > is already unsuitable for hot paths.
> > >
> > > The architecture specific callback function for notifying the
> > > hypervisor typically must translate guest kernel virtual addresses
> > > into guest physical addresses to pass to the hypervisor. Because
> > > the PTEs are invalid at the time of callback, the code for doing the
> > > translation needs updating. virt_to_phys() or equivalent continues
> > > to work for direct map addresses. But vmalloc addresses cannot use
> > > vmalloc_to_page() because that function requires the leaf PTE to be
> > > valid. Instead, slow_virt_to_phys() must be used. Both functions
> > > manually walk the page table hierarchy, so performance is the same.
> >
> > Uhmm.. But why do we expected slow_virt_to_phys() to work on non-present
> > page table entries? It seems accident for me that it works now. Somebody
> > forgot pte_present() check.
>
> I didn't research the history of slow_virt_to_phys(), but I'll do so.
>

The history of slow_virt_to_phys() doesn't show any discussion of
whether it is expected to work for non-present PTEs. However, the
page table walking is done by lookup_address(), which explicitly
*does* work for non-present PTEs. For example, lookup_address()
is used in cpa_flush() to find a PTE. cpa_flush() then checks to see if
the PTE is present. The comparable vmalloc_to_page() *does* require
the PTE to be present, but arguably that's because it is returning a
struct page * rather than a phys_addr_t.

So I don't know that the pte_present() check was forgotten in
slow_virt_to_phys(). FWIW, Dave Hansen originally added
slow_virt_to_phys() about 10 years ago. :-)

> >
> > Generally, if present bit is clear we cannot really assume anything about
> > the rest of the bits without external hints.
> >
> > This smells bad.
>
> Maybe so. But if we've just cleared the present bit, then we really
> do know something about the rest of the bits. We could either
> document a constraint that slow_virt_to_phys() should not assume
> the present bit, or write a separate but equivalent function that
> doesn't assume the present bit.
>
> >
> > Maybe the interface has to be reworked to operate on GPAs?
>
> I'll experiment and see what that might look like, and if it is
> better than doing the page table walk after the present bit
> is cleared.

Converting the enc_status_change_prepare() and
enc_status_change_finish() callbacks to use GPAs instead of
virtual addresses is doable, but makes the code more complex.
The detection of a vmalloc address must be done in
__set_memory_enc_pgtable() where the private<->shared
conversion is done, with looping over the individual pages. This
is what Dexuan Cui's patch [1] for the TDX implementation of the
callbacks does to handle vmalloc addresses.

There are three implementations of the callbacks: TDX, SEV-SNP,
and SEV-SNP + vTOM (for Hyper-V). Changing the TDX callbacks to use
GPAs is easy. Changing the SEV-SNP and SEV-SNP + vTOM callbacks is
messier because in both cases the underlying hypercalls support
supplying a batch of multiple GPAs that aren't necessarily contiguous
(TDX only supports a single contiguous range). Batching helps the
performance of the vmalloc case, but the top-level loop in
__set_memory_enc_pgtable() must provide a batch of GPAs instead
of doing them one-at-a-time. The callback implementations must also
loop because the batch size they support might be different from
what __set_memory_enc_pgtable() supplies.

All of that is doable. But adding the complexity doesn't seem like
the right tradeoff vs. just documenting that slow_virt_to_phys() is
expected to work even if the PTE is not present, especially since
lookup_address() already has that requirement. And when
we're doing private<->shared conversions, we really do know the
state of the PTEs even after the present bit is cleared.

If it would help to see the actual code for the callbacks to use
GPAs instead of virtual addresses, I'll do that work. But I'm
hoping just using slow_virt_to_phys() will be judged to be the
better solution.

Michael

[1] https://lore.kernel.org/lkml/20230811214826.9609-3-decui@xxxxxxxxxxxxx/

> >
> >
> > > Signed-off-by: Michael Kelley <mikelley@xxxxxxxxxxxxx>
> > > ---
> > >
> > > I'm assuming the TDX handling of the data cache flushing is the
> > > same with this new approach, and that it doesn't need to be paired
> > > with a TLB flush as in the current code. If that's not a correct
> > > assumption, let me know.
>
> Kirill -- could you comment on the above? I don't fully understand
> the TDX scenarios that need data cache flushing.
>
> > >
> > > I've left the two hypervisor callbacks, before and after Step 5
> > > above. If the PTEs are invalid, it seems like the order of Step 5
> > > and Step 6 shouldn't matter, so perhaps one of the callback could
> > > be dropped. Let me know if there are reasons to do otherwise.
> > >
> > > It may well be possible to optimize the new implementation of
> > > __set_memory_enc_pgtable(). The existing set_memory_np() and
> > > set_memory_p() functions do all the right things in a very clear
> > > fashion, but perhaps not as optimally as having all three PTE
> > > manipulations directly in the same function. It doesn't appear
> > > that optimizing the performance really matters here, but I'm open
> > > to suggestions.
> > >
> > > I've tested this on TDX VMs and SEV-SNP + vTOM VMs. I can also
> > > test on SEV-SNP VMs without vTOM. But I'll probably need help
> > > covering SEV and SEV-ES VMs.
> > >
> > > This RFC patch does *not* remove code that would no longer be
> > > needed. If there's agreement to take this new approach, I'll
> > > add patches to remove the unneeded code.
> > >
> > > This patch is built against linux-next20230704.
> > >
> > > arch/x86/hyperv/ivm.c | 3 ++-
> > > arch/x86/kernel/sev.c | 2 +-
> > > arch/x86/mm/pat/set_memory.c | 32 ++++++++++++--------------------
> > > 3 files changed, 15 insertions(+), 22 deletions(-)
> > >
> > > diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
> > > index 28be6df..2859ec3 100644
> > > --- a/arch/x86/hyperv/ivm.c
> > > +++ b/arch/x86/hyperv/ivm.c
> > > @@ -308,7 +308,8 @@ static bool hv_vtom_set_host_visibility(unsigned long
> kbuffer, int pagecount, bo
> > > return false;
> > >
> > > for (i = 0, pfn = 0; i < pagecount; i++) {
> > > - pfn_array[pfn] = virt_to_hvpfn((void *)kbuffer + i *
> HV_HYP_PAGE_SIZE);
> > > + pfn_array[pfn] = slow_virt_to_phys((void *)kbuffer +
> > > + i * HV_HYP_PAGE_SIZE) >>
> HV_HYP_PAGE_SHIFT;
> > > pfn++;
> > >
> > > if (pfn == HV_MAX_MODIFY_GPA_REP_COUNT || i == pagecount - 1) {
> > > diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> > > index 1ee7bed..59db55e 100644
> > > --- a/arch/x86/kernel/sev.c
> > > +++ b/arch/x86/kernel/sev.c
> > > @@ -784,7 +784,7 @@ static unsigned long __set_pages_state(struct snp_psc_desc
> *data, unsigned long
> > > hdr->end_entry = i;
> > >
> > > if (is_vmalloc_addr((void *)vaddr)) {
> > > - pfn = vmalloc_to_pfn((void *)vaddr);
> > > + pfn = slow_virt_to_phys((void *)vaddr) >> PAGE_SHIFT;
> > > use_large_entry = false;
> > > } else {
> > > pfn = __pa(vaddr) >> PAGE_SHIFT;
> > > diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> > > index bda9f12..8a194c7 100644
> > > --- a/arch/x86/mm/pat/set_memory.c
> > > +++ b/arch/x86/mm/pat/set_memory.c
> > > @@ -2136,6 +2136,11 @@ static int __set_memory_enc_pgtable(unsigned long
> addr, int numpages, bool enc)
> > > if (WARN_ONCE(addr & ~PAGE_MASK, "misaligned address: %#lx\n", addr))
> > > addr &= PAGE_MASK;
> > >
> > > + /* set_memory_np() removes aliasing mappings and flushes the TLB */
> > > + ret = set_memory_np(addr, numpages);
> > > + if (ret)
> > > + return ret;
> > > +
> > > memset(&cpa, 0, sizeof(cpa));
> > > cpa.vaddr = &addr;
> > > cpa.numpages = numpages;
> > > @@ -2143,36 +2148,23 @@ static int __set_memory_enc_pgtable(unsigned long
> addr, int numpages, bool enc)
> > > cpa.mask_clr = enc ? pgprot_decrypted(empty) : pgprot_encrypted(empty);
> > > cpa.pgd = init_mm.pgd;
> > >
> > > - /* Must avoid aliasing mappings in the highmem code */
> > > - kmap_flush_unused();
> > > - vm_unmap_aliases();
> >
> >
> > Why did you drop this?
>
> Both functions are already called by set_memory_np() ->
> change_page_attr_clear() -> change_page_attr_set_clr().
>
> >
> > > -
> > > /* Flush the caches as needed before changing the encryption attribute. */
> > > - if (x86_platform.guest.enc_tlb_flush_required(enc))
> > > - cpa_flush(&cpa, x86_platform.guest.enc_cache_flush_required());
> > > + if (x86_platform.guest.enc_cache_flush_required())
> > > + cpa_flush(&cpa, 1);
> >
> > Do you remove the last enc_cache_flush_required() user?
>
> Yes, I think so. As I commented above, this RFC version doesn't remove
> all the unneeded code. If there's agreement to move forward, I'll do that.
>
> >
> > > /* Notify hypervisor that we are about to set/clr encryption attribute. */
> > > if (!x86_platform.guest.enc_status_change_prepare(addr, numpages, enc))
> > > return -EIO;
> > >
> > > ret = __change_page_attr_set_clr(&cpa, 1);
> > > -
> > > - /*
> > > - * After changing the encryption attribute, we need to flush TLBs again
> > > - * in case any speculative TLB caching occurred (but no need to flush
> > > - * caches again). We could just use cpa_flush_all(), but in case TLB
> > > - * flushing gets optimized in the cpa_flush() path use the same logic
> > > - * as above.
> > > - */
> > > - cpa_flush(&cpa, 0);
> > > + if (ret)
> > > + return ret;
> > >
> > > /* Notify hypervisor that we have successfully set/clr encryption attribute. */
> > > - if (!ret) {
> > > - if (!x86_platform.guest.enc_status_change_finish(addr, numpages,
> enc))
> > > - ret = -EIO;
> > > - }
> > > + if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc))
> > > + return -EIO;
> > >
> > > - return ret;
> > > + return set_memory_p(&addr, numpages);
> > > }
> > >
> > > static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
> > > --
> > > 1.8.3.1
> > >
> >
> > --
> > Kiryl Shutsemau / Kirill A. Shutemov