Re: [PATCH Part2 v5 00/45] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support

From: Marc Orr
Date: Sun Nov 14 2021 - 02:42:00 EST


On Sat, Nov 13, 2021 at 10:28 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Fri, Nov 12, 2021, Marc Orr wrote:
> > On Fri, Nov 12, 2021 at 4:53 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > > On Fri, Nov 12, 2021, Peter Gonda wrote:
> > > > Having a way for userspace to lock pages as shared was an idea I just
> > > > proposed the simplest solution to start the conversation.
> > >
> > > Assuming you meant that to read:
> > >
> > > Having a way for userspace to lock pages as shared is an alternative idea; I
> > > just proposed the simplest solution to start the conversation.
> > >
> > > The unmapping[*] guest private memory proposal is essentially that, a way for userspace
> > > to "lock" the state of a page by requiring all conversions to be initiated by userspace
> > > and by providing APIs to associate a pfn 1:1 with a KVM instance, i.e. lock a pfn to
> > > a guest.
> > >
> > > Andy's DMA example brings up a very good point though. If the shared and private
> > > variants of a given GPA are _not_ required to point at a single PFN, which is the
> > > case in the current unmapping proposal, userspace doesn't need to do any additional
> > > juggling to track guest conversions across multiple processes.
> > >
> > > Any process that's accessing guest (shared!) memory simply does its locking as normal,
> > > which as Andy pointed out, is needed for correctness today. If the guest requests a
> > > conversion from shared=>private without first ensuring the gfn is unused (by a host
> > > "device"), the host will side will continue accessing the old, shared memory, which it
> > > locked, while the guest will be doing who knows what. And if the guest provides a GPA
> > > that isn't mapped shared in the VMM's address space, it's conceptually no different
> > > than if the guest provided a completely bogus GPA, which again needs to be handled today.
> > >
> > > In other words, if done properly, differentiating private from shared shouldn't be a
> > > heavy lift for host userspace.
> > >
> > > [*] Actually unmapping memory may not be strictly necessary for SNP because a
> > > #PF(RMP) is likely just as good as a #PF(!PRESENT) when both are treated as
> > > fatal, but the rest of the proposal that allows KVM to understand the stage
> > > of a page and exit to userspace accordingly applies.
> >
> > Thanks for this explanation. When you write "while the guest will be
> > doing who knows what":
> >
> > Isn't that a large weakness of this proposal? To me, it seems better
> > for debuggability to corrupt the private memory (i.e., convert the
> > page to shared) so the guest can detect the issue via a PVALIDATE
> > failure.
>
> The behavior is no different than it is today for regular VMs.

Isn't this counter to the sketch you laid out earlier where you wrote:

--- QUOTE START ---
- if userspace accesses guest private memory, it gets SIGSEGV or whatever.
- if kernel accesses guest private memory, it does BUG/panic/oops[*]
- if guest accesses memory with the incorrect C/SHARED-bit, it gets killed.
--- QUOTE END ---

Here, the guest does not get killed. Which seems hard to debug.

> > The main issue I see with corrupting the guest memory is that we may
> > not know whether the host is at fault or the guest.
>
> Yes, one issue is that bugs in the host will result in downstream errors in the
> guest, as opposed to immediate, synchronous detection in the guest. IMO that is
> a significant flaw.

Nobody wants bugs in the host. Once we've hit one -- and we will --
we're in a bad situation. The question is how will we handle these
bugs. I'm arguing that we should design the system to be as robust as
possible.

I agree that immediate, synchronous detection in the guest is ideal.
But at what cost? Is it worth killing _ALL_ VMs on the system? That's
what we're doing when we crash host-wide processes, or even worse, the
kernel itself.

The reality is that guests must be able to detect writes to their
private memory long after they have occurred. Both SNP and TDX are
designed this way! For SNP the guest gets a PVALIDATE failure when it
reads the corrupted memory. For TDX, my understanding is that the
hardware fails to verify an HMAC over a page when it's read by the
guest.

The argument I'm making is that the success of confidential VMs -- in
both SNP and TDX -- depends on the guest being able to detect that its
private memory has been corrupted in a delayed and asynchronous
fashion. We should be leveraging this fact to make the entire system
more robust and reliable.

> Another issue is that the host kernel, which despite being "untrusted", absolutely
> should be acting in the best interests of the guest. Allowing userspace to inject
> #VC, e.g. to attempt to attack the guest by triggering a spurious PVALIDATE, means
> the kernel is failing miserably on that front.

The host is acting in the best interests of the guest. That's why
we're having this debate :-). No-one here is trying to write host code
to sabotage the guest. Quite the opposite.

But what happens when we mess up? That's what this conversation is really about.

If allowing userspace to inject #VC into the guest means that the host
can continue to serve other guests, that seems like a win. The
alternative, to blow up the host, essentially expands the blast radius
from a single guest to all guests.

Also, these attacks go both ways. As we already discussed, the guest
may try to trick the host into writing its own private memory. Yes,
the entire idea to literally make that impossible is a good one -- in
theory. But it's also very complicated. And what happens if we get
that wrong? Now our entire host is at risk from a single guest.

Leveraging the system-wide design of SNP and TDX -- where a detecting
writes to private memory asynchronously is table stakes -- increases
the reliability of the entire system. And, as Peter mentioned earlier
on, we can always incorporate any future work to make writing private
memory impossible into SNP, on top of the code to convert the shared
page to private. This way, we get reliability in depth, and minimize
the odds of crashing the host -- and its VMs.