Re: [PATCH Part2 v5 39/45] KVM: SVM: Introduce ops for the post gfn map and unmap

From: Michael Roth
Date: Thu Sep 08 2022 - 18:29:11 EST


On Thu, Sep 08, 2022 at 04:21:14PM -0500, Michael Roth wrote:
> On Fri, Oct 15, 2021 at 05:16:28PM +0000, Sean Christopherson wrote:
> > On Fri, Oct 15, 2021, Brijesh Singh wrote:
> > >
> > > On 10/13/21 1:16 PM, Sean Christopherson wrote:
> > > > On Wed, Oct 13, 2021, Sean Christopherson wrote:
> > > >> On Fri, Aug 20, 2021, Brijesh Singh wrote:
> > > >>> When SEV-SNP is enabled in the guest VM, the guest memory pages can
> > > >>> either be a private or shared. A write from the hypervisor goes through
> > > >>> the RMP checks. If hardware sees that hypervisor is attempting to write
> > > >>> to a guest private page, then it triggers an RMP violation #PF.
> > > >>>
> > > >>> To avoid the RMP violation, add post_{map,unmap}_gfn() ops that can be
> > > >>> used to verify that its safe to map a given guest page. Use the SRCU to
> > > >>> protect against the page state change for existing mapped pages.
> > > >> SRCU isn't protecting anything. The synchronize_srcu_expedited() in the PSC code
> > > >> forces it to wait for existing maps to go away, but it doesn't prevent new maps
> > > >> from being created while the actual RMP updates are in-flight. Most telling is
> > > >> that the RMP updates happen _after_ the synchronize_srcu_expedited() call.
> > > > Argh, another goof on my part. Rereading prior feedback, I see that I loosely
> > > > suggested SRCU as a possible solution. That was a bad, bad suggestion. I think
> > > > (hope) I made it offhand without really thinking it through. SRCU can't work in
> > > > this case, because the whole premise of Read-Copy-Update is that there can be
> > > > multiple copies of the data. That simply can't be true for the RMP as hardware
> > > > operates on a single table.
> > > >
> > > > In the future, please don't hesitate to push back on and/or question suggestions,
> > > > especially those that are made without concrete examples, i.e. are likely off the
> > > > cuff. My goal isn't to set you up for failure :-/
> > >
> > > What do you think about going back to my initial proposal of per-gfn
> > > tracking [1] ? We can limit the changes to just for the kvm_vcpu_map()
> > > and let the copy_to_user() take a fault and return an error (if it
> > > attempt to write to guest private). If PSC happen while lock is held
> > > then simplify return and let the guest retry PSC.
> >
> > That approach is also broken as it doesn't hold a lock when updating host_write_track,
> > e.g. the count can be corrupted if writers collide, and nothing blocks writers on
> > in-progress readers.
> >
> > I'm not opposed to a scheme that blocks PSC while KVM is reading, but I don't want
> > to spend time iterating on the KVM case until consensus has been reached on how
> > exactly RMP updates will be handled, and in general how the kernel will manage
> > guest private memory.
>
> Hi Sean,
>
> (Sorry in advance for the long read, but it touches on a couple
> inter-tangled topics that I think are important for this series.)
>
> While we do still remain committed to working with community toward
> implementing a UPM solution, we would still like to have a 'complete'
> solution in the meantime, if at the very least to use as a basis for
> building out other parts of the stack, or for enabling early adopters
> downstream doing the same.
>
> Toward that end, there are a couple areas that need to be addressed,
> (and remain unaddressed with v6, since we were planning to leverage UPM
> to handle them and so left them as open TODOs at the time):
>
> 1) this issue of guarding against shared->private conversions for pages
> that are in-use by kernel
> 2) how to deal with unmapping/splitting the direct map
>
> (These 2 things end up being fairly tightly coupled, which is why I'm
> trying to cover them both in this email. Will explain more in a bit)
>
> You mentioned elsewhere how 1) would be nicely addressed with UPM, since
> the conversion would only point the guest to an entirely new page, while
> leaving the shared page intact (at least until the normal notifiers do
> their thing in response to any subsequent discard operations from
> userspace). If the guest breaks because it doesn't see the write, that's
> okay, because it's not supposed to be switching in-use shared pages to
> private while they are being used. (though the pKVM folks did note in v7
> of private memslot patches that they actually use the same physical page
> for both shared/private, so I'm not sure if that approach will still
> stack up in that context, if it ends up being needed there)
>
> So in the context of this interim solution, we're trying to look for a
> solution that's simple enough that it can be used reliably, without
> introducing too much additional complexity into KVM. There is one
> approach that seems to fit that bill, that Brijesh attempted in an
> earlier version of this series (I'm not sure what exactly was the
> catalyst to changing the approach, as I wasn't really in the loop at
> the time, but AIUI there weren't any showstoppers there, but please
> correct me if I'm missing anything):
>
> - if the host is writing to a page that it thinks is supposed to be
> shared, and the guest switches it to private, we get an RMP fault
> (actually, we will get a !PRESENT fault, since as of v5 we now
> remove the mapping from the directmap as part of conversion)
> - in the host #PF handler, if we see that the page is marked private
> in the RMP table, simply switch it back to shared
> - if this was a bug on the part of the host, then the guest will see
> the validated bit was cleared, and get a #VC where it can
> terminate accordingly
> - if this was a bug (or maliciousness) on the part of the guest,
> then the behavior is unexpected anyway, and it will be made
> aware this by the same mechanism
>
> We've done some testing with this approach and it seems to hold up,
> but since we also need the handler to deal with the !PRESENT case
> (because of the current approach of nuking directmap entry for
> private pages), there's a situation that can't be easily resolved
> with this approach:
>
> # CASE1: recoverable
>
> THREAD 1 THREAD 2
> - #PF(!PRESENT) - #PF(!PRESENT)
> - restore directmap
> - RMPUPDATE(shared)
> - not private? okay to retry since maybe it was
> fixed up already
>
> # CASE2: unrecoverable
>
> THREAD 1
> - broken kernel breaks directmap/init-mm, not related to SEV
> - #PF(!PRESENT)
> - not private? we should oops, since retrying might cause a loop
>
> The reason we can't distinguish between recoverable/unrecoverable is
> because the kernel #PF handling here needs to happen for both !PRESENT
> and for RMP violations. This is due to how we unmap from directmap as
> part of shared->private conversion.
>
> So we have to take the pessimistic approach due to CASE2, which
> means that if CASE1 pops up, we'll still have a case where guest can
> cause a crash/loop.
>
> That where 2) comes into play. If we go back to the original approach
> (as of v4) of simply splitting the directmap when we do shared->private
> conversions, then we can limit the #PF handling to only have to deal
> with cases where there's an RMP violation and X86_PF_RMP is set, which
> makes it safe to retry to handle spurious cases like case #1.
>
> AIUI, this is still sort of an open question, but you noted how nuking
> the directmap without any formalized interface for letting the kernel
> know about it could be problematic down the road, which also sounds
> like the sort of thing more suited for having UPM address at a more
> general level, since there are similar requirements for TDX as well.
>
> AIUI there are 2 main arguments against splitting the directmap:
> a) we can't easily rebuild it atm
> b) things like KSM might still tries to access private pages
>
> But unmapping also suffers from a), since we still end up splitting the
> directmap unless we remove pages in blocks of 2M. But nothing prevents
> a guest from switching a single 4K page to private, in which case we
> are forced to split. That would be normal behavior on the part of the
> guest for setting up GHCB pages/etc, so we still end up splitting the
> directmap over time.

One more thing to note here, is with the splitting approach, we also
have the option of doing all the splitting when the region is registered
via KVM_MEMORY_ENCRYPT_REG_REGION. If unsplitting becomes possible in
the future, we could then handle that in KVM_MEMORY_ENCRYPT_UNREG_REGION
all at once which, where we'd have a bit more flexibility for going
about that.

-Mike

>
> And for b), as you noted, this is already something that SEV/SEV-ES
> need to deal with, and at least in the case of SNP guest things are
> a bit better since:
>
> - if some kernel thread tried to write to private memory, they
> would notice if some errant kernel thread tried to write to guest
> memory, since #PF handler with flip the page and unset the
> validated bit on that memory.
>
> - for reads, they will get ciphertext, which isn't any worse than
> reading unencrypted guest memory for anything outside of KVM that
> specifically knows what it is expecting to read there (KSM being
> one exception, since it'll waste cycles cmp'ing ciphertext, but
> there's no way to make it work anyway, and we would be fine with
> requiring it to be disabled if that is a concern)
>
> So that's sort of the context for why we're considering these 2 changes.
> Any input on these is welcome, but at the very least wanted to provide
> our rationale for past reviewers.
>
> Thanks!
>
> -Mike