Re: [PATCH Part2 v5 00/45] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support

From: Marc Orr
Date: Tue Nov 16 2021 - 00:41:47 EST


On Mon, Nov 15, 2021 at 11:15 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> +arm64 KVM folks
>
> On Mon, Nov 15, 2021, Marc Orr wrote:
> > On Mon, Nov 15, 2021 at 10:26 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > >
> > > On Mon, Nov 15, 2021, Dr. David Alan Gilbert wrote:
> > > > * Sean Christopherson (seanjc@xxxxxxxxxx) wrote:
> > > > > On Fri, Nov 12, 2021, Borislav Petkov wrote:
> > > > > > On Fri, Nov 12, 2021 at 09:59:46AM -0800, Dave Hansen wrote:
> > > > > > > Or, is there some mechanism that prevent guest-private memory from being
> > > > > > > accessed in random host kernel code?
> > > > >
> > > > > Or random host userspace code...
> > > > >
> > > > > > So I'm currently under the impression that random host->guest accesses
> > > > > > should not happen if not previously agreed upon by both.
> > > > >
> > > > > Key word "should".
> > > > >
> > > > > > Because, as explained on IRC, if host touches a private guest page,
> > > > > > whatever the host does to that page, the next time the guest runs, it'll
> > > > > > get a #VC where it will see that that page doesn't belong to it anymore
> > > > > > and then, out of paranoia, it will simply terminate to protect itself.
> > > > > >
> > > > > > So cloud providers should have an interest to prevent such random stray
> > > > > > accesses if they wanna have guests. :)
> > > > >
> > > > > Yes, but IMO inducing a fault in the guest because of _host_ bug is wrong.
> > > >
> > > > Would it necessarily have been a host bug? A guest telling the host a
> > > > bad GPA to DMA into would trigger this wouldn't it?
> > >
> > > No, because as Andy pointed out, host userspace must already guard against a bad
> > > GPA, i.e. this is just a variant of the guest telling the host to DMA to a GPA
> > > that is completely bogus. The shared vs. private behavior just means that when
> > > host userspace is doing a GPA=>HVA lookup, it needs to incorporate the "shared"
> > > state of the GPA. If the host goes and DMAs into the completely wrong HVA=>PFN,
> > > then that is a host bug; that the bug happened to be exploited by a buggy/malicious
> > > guest doesn't change the fact that the host messed up.
> >
> > "If the host goes and DMAs into the completely wrong HVA=>PFN, then
> > that is a host bug; that the bug happened to be exploited by a
> > buggy/malicious guest doesn't change the fact that the host messed
> > up."
> > ^^^
> > Again, I'm flabbergasted that you are arguing that it's OK for a guest
> > to exploit a host bug to take down host-side processes or the host
> > itself, either of which could bring down all other VMs on the machine.
> >
> > I'm going to repeat -- this is not OK! Period.
>
> Huh? At which point did I suggest it's ok to ship software with bugs? Of course
> it's not ok to introduce host bugs that let the guest crash the host (or host
> processes). But _if_ someone does ship buggy host software, it's not like we can
> wave a magic wand and stop the guest from exploiting the bug. That's why they're
> such a big deal.
>
> Yes, in this case a very specific flavor of host userspace bug could be morphed
> into a guest exception, but as mentioned ad nauseum, _if_ host userspace has bug
> where it does not properly validate a GPA=>HVA, then any such bug exists and is
> exploitable today irrespective of SNP.

If I'm understanding you correctly, you're saying that we will never
get into the host's page fault handler due to an RMP violation if we
implement the unmapping guest private memory proposal (without bugs).

However, bugs do happen. And the host-side page fault handler will
have code to react to an RMP violation (even if it's supposedly
impossible to hit). I'm saying that the host-side page fault handler
should NOT handle an RMP violation by killing host-side processes or
the kernel itself. This is detrimental to host reliability.

There are two ways to handle this. (1) Convert the private page
causing the RMP violation to shared, (2) Kill the guest.

Converting the private page to shared is a good solution in SNP's
threat model. And overall, it's better for debuggability than
immediately terminating the guest.

> > Again, if the community wants to layer some orchestration scheme
> > between host userspace, host kernel, and guest, on top of the code to
> > inject the #VC into the guest, that's fine. This proposal is not
> > stopping that. In fact, the two approaches are completely orthogonal
> > and compatible.
> >
> > But so far I have heard zero reasons why injecting a #VC into the
> > guest is wrong. Other than just stating that it's wrong.
>
> It creates a new attack surface, e.g. if the guest mishandles the #VC and does
> PVALIDATE on memory that it previously accepted, then userspace can attack the
> guest by accessing guest private memory to coerce the guest into consuming corrupted
> data.

We should handle RMP violations as best possible from within the
host-side page fault handler, independent of the proposal to unmap
private guest memory for all CVM architectures. Otherwise, if someone
figures out how to trigger an RMP violation by writing guest private
memory (despite unmapping guest private memory's goal to make this
impossible), the attack surface has now increased. Because now we're
either killing host processes or the kernel. Which is worse than
killing the single guest.

Second, I don't think it's correct to say that the host-side
implementation changes the attack surface. The guest must already be
hardened against host-side bugs and attacks. From the guest's
perspective, the attack surface is the same. Also, unmapping private
guest memory is going to require coordination across host userspace,
host kernel, and guest. That's a lot of code. And typically more code
means more attack surface. That being said, if all that code is
implemented perfectly, you might be right, that unmapping guest
private memory makes it harder for the host to attack the guest. But I
think it's a moot point. Because in the end, converting the page from
private to shared is entirely reasonable to solve this problem for
SNP, and we should be doing it irregardless, even if we do get
unmapping private memory working on SNP.