Re: [RFC PATCH 0/8] KVM: x86/mmu: Introduce pinned SPTEs framework

From: Brijesh Singh
Date: Mon Oct 26 2020 - 23:18:04 EST


Hi Sean,

On 8/4/20 2:40 PM, Brijesh Singh wrote:
> On 8/3/20 12:16 PM, Sean Christopherson wrote:
>> On Mon, Aug 03, 2020 at 10:52:05AM -0500, Brijesh Singh wrote:
>>> Thanks for series Sean. Some thoughts
>>>
>>>
>>> On 7/31/20 4:23 PM, Sean Christopherson wrote:
>>>> SEV currently needs to pin guest memory as it doesn't support migrating
>>>> encrypted pages. Introduce a framework in KVM's MMU to support pinning
>>>> pages on demand without requiring additional memory allocations, and with
>>>> (somewhat hazy) line of sight toward supporting more advanced features for
>>>> encrypted guest memory, e.g. host page migration.
>>> Eric's attempt to do a lazy pinning suffers with the memory allocation
>>> problem and your series seems to address it. As you have noticed,
>>> currently the SEV enablement  in the KVM does not support migrating the
>>> encrypted pages. But the recent SEV firmware provides a support to
>>> migrate the encrypted pages (e.g host page migration). The support is
>>> available in SEV FW >= 0.17.
>> I assume SEV also doesn't support ballooning? Ballooning would be a good
>> first step toward page migration as I think it'd be easier for KVM to
>> support, e.g. only needs to deal with the "zap" and not the "move".
>
> Yes, the ballooning does not work with the SEV.
>
>
>>>> The idea is to use a software available bit in the SPTE to track that a
>>>> page has been pinned. The decision to pin a page and the actual pinning
>>>> managment is handled by vendor code via kvm_x86_ops hooks. There are
>>>> intentionally two hooks (zap and unzap) introduced that are not needed for
>>>> SEV. I included them to again show how the flag (probably renamed?) could
>>>> be used for more than just pin/unpin.
>>> If using the available software bits for the tracking the pinning is
>>> acceptable then it can be used for the non-SEV guests (if needed). I
>>> will look through your patch more carefully but one immediate question,
>>> when do we unpin the pages? In the case of the SEV, once a page is
>>> pinned then it should not be unpinned until the guest terminates. If we
>>> unpin the page before the VM terminates then there is a  chance the host
>>> page migration will kick-in and move the pages. The KVM MMU code may
>>> call to drop the spte's during the zap/unzap and this happens a lot
>>> during a guest execution and it will lead us to the path where a vendor
>>> specific code will unpin the pages during the guest execution and cause
>>> a data corruption for the SEV guest.
>> The pages are unpinned by:
>>
>> drop_spte()
>> |
>> -> rmap_remove()
>> |
>> -> sev_drop_pinned_spte()
>>
>>
>> The intent is to allow unpinning pages when the mm_struct dies, i.e. when
>> the memory is no longer reachable (as opposed to when the last reference to
>> KVM is put), but typing that out, I realize there are dependencies and
>> assumptions that don't hold true for SEV as implemented.
>
> So, I tried this RFC with the SEV guest (of course after adding some of
> the stuff you highlighted below), the guest fails randomly. I have seen
> a two to three type of failures 1) boot 2) kernbench execution and 3)
> device addition/removal, the failure signature is not consistent. I
> believe after addressing some of the dependencies we may able to make
> some progress but it will add new restriction which did not existed before.
>
>> - Parent shadow pages won't be zapped. Recycling MMU pages and zapping
>> all SPs due to memslot updates are the two concerns.
>>
>> The easy way out for recycling is to not recycle SPs with pinned
>> children, though that may or may not fly with VMM admins.
>>
>> I'm trying to resolve the memslot issue[*], but confirming that there's
>> no longer an issue with not zapping everything is proving difficult as
>> we haven't yet reproduced the original bug.
>>
>> - drop_large_spte() won't be invoked. I believe the only semi-legitimate
>> scenario is if the NX huge page workaround is toggled on while a VM is
>> running. Disallowing that if there is an SEV guest seems reasonable?
>>
>> There might be an issue with the host page size changing, but I don't
>> think that can happen if the page is pinned. That needs more
>> investigation.
>>
>>
>> [*] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F20200703025047.13987-1-sean.j.christopherson%40intel.com&data=02%7C01%7Cbrijesh.singh%40amd.com%7C8d0dd94297ff4d24e54108d837d0f1dc%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637320717832773086&sdata=yAHvMptxstoczXBZkFCpNC4AbADOJOgluwAtIYCuNVo%3D&reserved=0


We would like to pin the guest memory on #NPF to reduce the boot delay
for the SEV guest. Are you planning to proceed with this RFC? With the
some fixes, I am able to get the RFC working for the SEV guest. I can
share those fixes with you so that you can include them on next
revision. One of the main roadblock I see is that the proposed framework
has a dependency on the memslot patch you mentioned above. Without the
memslot patch we will end up dropping (aka unpinning) spte during
memslot updates which is not acceptable for the SEV guest. I don't see
any resolution on the memslot patch yet. Any updates are appreciated. I
understand that getting memslot issue resolved may be difficult, so I am
wondering if in the meantime we should proceed with the xarray approach
to track the pinned pages and release them on VM termination.


>>>> Bugs in the core implementation are pretty much guaranteed. The basic
>>>> concept has been tested, but in a fairly different incarnation. Most
>>>> notably, tagging PRESENT SPTEs as PINNED has not been tested, although
>>>> using the PINNED flag to track zapped (and known to be pinned) SPTEs has
>>>> been tested. I cobbled this variation together fairly quickly to get the
>>>> code out there for discussion.
>>>>
>>>> The last patch to pin SEV pages during sev_launch_update_data() is
>>>> incomplete; it's there to show how we might leverage MMU-based pinning to
>>>> support pinning pages before the guest is live.
>>> I will add the SEV specific bits and  give this a try.
>>>
>>>> Sean Christopherson (8):
>>>> KVM: x86/mmu: Return old SPTE from mmu_spte_clear_track_bits()
>>>> KVM: x86/mmu: Use bits 2:0 to check for present SPTEs
>>>> KVM: x86/mmu: Refactor handling of not-present SPTEs in mmu_set_spte()
>>>> KVM: x86/mmu: Add infrastructure for pinning PFNs on demand
>>>> KVM: SVM: Use the KVM MMU SPTE pinning hooks to pin pages on demand
>>>> KVM: x86/mmu: Move 'pfn' variable to caller of direct_page_fault()
>>>> KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by SEV
>>>> KVM: SVM: Pin SEV pages in MMU during sev_launch_update_data()
>>>>
>>>> arch/x86/include/asm/kvm_host.h | 7 ++
>>>> arch/x86/kvm/mmu.h | 3 +
>>>> arch/x86/kvm/mmu/mmu.c | 186 +++++++++++++++++++++++++-------
>>>> arch/x86/kvm/mmu/paging_tmpl.h | 3 +-
>>>> arch/x86/kvm/svm/sev.c | 141 +++++++++++++++++++++++-
>>>> arch/x86/kvm/svm/svm.c | 3 +
>>>> arch/x86/kvm/svm/svm.h | 3 +
>>>> 7 files changed, 302 insertions(+), 44 deletions(-)
>>>>