Re: [PATCH Part2 v5 00/45] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support

From: Brijesh Singh
Date: Mon Nov 22 2021 - 10:23:46 EST

Hi Peter,

On 11/12/21 9:43 AM, Peter Gonda wrote:
Hi Brijesh,,

One high level discussion I'd like to have on these SNP KVM patches.

In these patches (V5) if a host userspace process writes a guest
private page a SIGBUS is issued to that process. If the kernel writes
a guest private page then the kernel panics due to the unhandled RMP
fault page fault. This is an issue because not all writes into guest
memory may come from a bug in the host. For instance a malicious or
even buggy guest could easily point the host to writing a private page
during the emulation of many virtual devices (virtio, NVMe, etc). For
example if a well behaved guests behavior is to: start up a driver,
select some pages to share with the guest, ask the host to convert
them to shared, then use those pages for virtual device DMA, if a
buggy guest forget the step to request the pages be converted to
shared its easy to see how the host could rightfully write to private
memory. I think we can better guarantee host reliability when running
SNP guests without changing SNP’s security properties.

Here is an alternative to the current approach: On RMP violation (host
or userspace) the page fault handler converts the page from private to
shared to allow the write to continue. This pulls from s390’s error
handling which does exactly this. See ‘arch_make_page_accessible()’.
Additionally it adds less complexity to the SNP kernel patches, and
requires no new ABI.

In the current (V5) KVM implementation if a userspace process
generates an RMP violation (writes to guest private memory) the
process receives a SIGBUS. At first glance, it would appear that
user-space shouldn’t write to private memory. However, guaranteeing
this in a generic fashion requires locking the RMP entries (via locks
external to the RMP). Otherwise, a user-space process emulating a
guest device IO may be vulnerable to having the guest memory
(maliciously or by guest bug) converted to private while user-space
emulation is happening. This results in a well behaved userspace
process receiving a SIGBUS.

This proposal allows buggy and malicious guests to run under SNP
without jeopardizing the reliability / safety of host processes. This
is very important to a cloud service provider (CSP) since it’s common
to have host wide daemons that write/read all guests, i.e. a single
process could manage the networking for all VMs on the host. Crashing
that singleton process kills networking for all VMs on the system.

Thank you for starting the thread; based on the discussion, I am keeping the current implementation as-is and *not* going with the auto conversion from private to shared. To summarize what we are doing in the current SNP series:

- If userspace accesses guest private memory, it gets SIGBUS.
- If kernel accesses[*] guest private memory, it does panic.

[*] Kernel consults the RMP table for the page ownership before the access. If the page is shared, then it uses the locking mechanism to ensure that a guest will not be able to change the page ownership while kernel has it mapped.


This proposal also allows for minimal changes to the kexec flow and
kdump. The new kexec kernel can simply update private pages to shared
as it encounters them during their boot. This avoids needing to
propagate the RMP state from kernel to kernel. Of course this doesn’t
preserve any running VMs but is still useful for kdump crash dumps or
quicker rekerneling for development with kexec.

This proposal does cause guest memory corruption for some bugs but one
of SEV-SNP’s goals extended from SEV-ES’s goals is for guest’s to be
able to detect when its memory has been corrupted / replayed by the
host. So SNP already has features for allowing guests to detect this
kind of memory corruption. Additionally this is very similar to a page
of memory generating a machine check because of 2-bit memory
corruption. In other words SNP guests must be enlightened and ready
for these kinds of errors.

For an SNP guest running under this proposal the flow would look like this:
* Host gets a #PF because its trying to write to a private page.
* Host #PF handler updates the page to shared.
* Write continues normally.
* Guest accesses memory (r/w).
* Guest gets a #VC error because the page is not PVALIDATED
* Guest is now in control. Guest can terminate because its memory has
been corrupted. Guest could try and continue to log the error to its

A similar approach was introduced in the SNP patches V1 and V2 for
kernel page fault handling. The pushback around this convert to shared
approach was largely focused around the idea that the kernel has all
the information about which pages are shared vs private so it should
be able to check shared status before write to pages. After V2 the
patches were updated to not have a kernel page fault handler for RMP
violations (other than dumping state during a panic). The current
patches protect the host with new post_{map,unmap}_gfn() function that
checks if a page is shared before mapping it, then locks the page
shared until unmapped. Given the discussions on ‘[Part2,v5,39/45] KVM:
SVM: Introduce ops for the post gfn map and unmap’ building a solution
to do this is non trivial and adds new overheads to KVM. Additionally
the current solution is local to the kernel. So a new ABI just now be
created to allow the userspace VMM to access the kernel-side locks for
this to work generically for the whole host. This is more complicated
than this proposal and adding more lock holders seems like it could
reduce performance further.

There are a couple corner cases with this approach. Under SNP guests
can request their memory be changed into a VMSA. This VMSA page cannot
be changed to shared while the vCPU associated with it is running. So
KVM + the #PF handler will need something to kick vCPUs from running.
Joerg believes that a possible fix for this could be a new MMU
notifier in the kernel, then on the #PF we can go through the rmp and
execute this vCPU kick callback.

Another corner case is the RMPUPDATE instruction is not guaranteed to
succeed on first iteration. As noted above if the page is a VMSA it
cannot be updated while the vCPU is running. Another issue is if the
guest is running a RMPADJUST on a page it cannot be RMPUPDATED at that
time. There is a lock for each RMP Entry so there is a race for these
instructions. The vCPU kicking can solve this issue to be kicking all
guest vCPUs which removes the chance for the race.

Since this proposal probably results in SNP guests terminating due to
a page unexpectedly needing PVALIDATE. The approach could be
simplified to just the KVM killing the guest. I think it's nicer to
users to instead of unilaterally killing the guest allowing the
unvalidated #VC exception to allow users to collect some additional
debug information and any additional clean up work they would like to


On Fri, Aug 20, 2021 at 9:59 AM Brijesh Singh <brijesh.singh@xxxxxxx> wrote:

This part of the Secure Encrypted Paging (SEV-SNP) series focuses on the
changes required in a host OS for SEV-SNP support. The series builds upon
SEV-SNP Part-1.

This series provides the basic building blocks to support booting the SEV-SNP
VMs, it does not cover all the security enhancement introduced by the SEV-SNP
such as interrupt protection.

The CCP driver is enhanced to provide new APIs that use the SEV-SNP
specific commands defined in the SEV-SNP firmware specification. The KVM
driver uses those APIs to create and managed the SEV-SNP guests.

The GHCB specification version 2 introduces new set of NAE's that is
used by the SEV-SNP guest to communicate with the hypervisor. The series
provides support to handle the following new NAE events:
- Register GHCB GPA
- Page State Change Request
- Hypevisor feature
- Guest message request

The RMP check is enforced as soon as SEV-SNP is enabled. Not every memory
access requires an RMP check. In particular, the read accesses from the
hypervisor do not require RMP checks because the data confidentiality is
already protected via memory encryption. When hardware encounters an RMP
checks failure, it raises a page-fault exception. If RMP check failure
is due to the page-size mismatch, then split the large page to resolve
the fault.

The series does not provide support for the interrupt security and migration
and those feature will be added after the base support.

The series is based on the commit:
SNP part1 commit and
fa7a549d321a (kvm/next, next) KVM: x86: accept userspace interrupt only if no event is injected

* Add support for command to ratelimit the guest message request.

Changes since v4:
* Move the RMP entry definition to x86 specific header file.
* Move the dump RMP entry function to SEV specific file.
* Use BIT_ULL while defining the #PF bit fields.
* Add helper function to check the IOMMU support for SEV-SNP feature.
* Add helper functions for the page state transition.
* Map and unmap the pages from the direct map after page is added or
removed in RMP table.
* Enforce the minimum SEV-SNP firmware version.
* Extend the LAUNCH_UPDATE to accept the base_gfn and remove the
logic to calculate the gfn from the hva.
* Add a check in LAUNCH_UPDATE to ensure that all the pages are
shared before calling the PSP.
* Mark the memory failure when failing to remove the page from the
RMP table or clearing the immutable bit.
* Exclude the encrypted hva range from the KSM.
* Remove the gfn tracking during the kvm_gfn_map() and use SRCU to
syncronize the PSC and gfn mapping.
* Allow PSC on the registered hva range only.
* Add support for the Preferred GPA VMGEXIT.
* Simplify the PSC handling routines.
* Use the static_call() for the newly added kvm_x86_ops.
* Remove the long-lived GHCB map.
* Move the snp enable module parameter to the end of the file.
* Remove the kvm_x86_op for the RMP fault handling. Call the
fault handler directly from the #NPF interception.

Changes since v3:
* Add support for extended guest message request.
* Add ioctl to query the SNP Platform status.
* Add ioctl to get and set the SNP config.
* Add check to verify that memory reserved for the RMP covers the full system RAM.
* Start the SNP specific commands from 256 instead of 255.
* Multiple cleanup and fixes based on the review feedback.

Changes since v2:
* Add AP creation support.
* Drop the patch to handle the RMP fault for the kernel address.
* Add functions to track the write access from the hypervisor.
* Do not enable the SNP feature when IOMMU is disabled or is in passthrough mode.
* Dump the RMP entry on RMP violation for the debug.
* Shorten the GHCB macro names.
* Start the SNP_INIT command id from 255 to give some gap for the legacy SEV.
* Sync the header with the latest 0.9 SNP spec.

Changes since v1:
* Add AP reset MSR protocol VMGEXIT NAE.
* Add Hypervisor features VMGEXIT NAE.
* Move the RMP table initialization and RMPUPDATE/PSMASH helper in
* Add support to map/unmap SEV legacy command buffer to firmware state when
SNP is active.
* Enhance PSP driver to provide helper to allocate/free memory used for the
firmware context page.
* Add support to handle RMP fault for the kernel address.
* Add support to handle GUEST_REQUEST NAE event for attestation.
* Rename RMP table lookup helper.
* Drop typedef from rmpentry struct definition.
* Drop SNP static key and use cpu_feature_enabled() to check whether SEV-SNP
is active.
* Multiple cleanup/fixes to address Boris review feedback.

Brijesh Singh (40):
x86/cpufeatures: Add SEV-SNP CPU feature
iommu/amd: Introduce function to check SEV-SNP support
x86/sev: Add the host SEV-SNP initialization support
x86/sev: Add RMP entry lookup helpers
x86/sev: Add helper functions for RMPUPDATE and PSMASH instruction
x86/sev: Invalid pages from direct map when adding it to RMP table
x86/traps: Define RMP violation #PF error code
x86/fault: Add support to handle the RMP fault for user address
x86/fault: Add support to dump RMP entry on fault
crypto: ccp: shutdown SEV firmware on kexec
crypto:ccp: Define the SEV-SNP commands
crypto: ccp: Add support to initialize the AMD-SP for SEV-SNP
crypto:ccp: Provide APIs to issue SEV-SNP commands
crypto: ccp: Handle the legacy TMR allocation when SNP is enabled
crypto: ccp: Handle the legacy SEV command when SNP is enabled
crypto: ccp: Add the SNP_PLATFORM_STATUS command
crypto: ccp: Add the SNP_{SET,GET}_EXT_CONFIG command
crypto: ccp: Provide APIs to query extended attestation report
KVM: SVM: Provide the Hypervisor Feature support VMGEXIT
KVM: SVM: Make AVIC backing, VMSA and VMCB memory allocation SNP safe
KVM: SVM: Add initial SEV-SNP support
KVM: SVM: Add KVM_SNP_INIT command
KVM: SVM: Mark the private vma unmerable for SEV-SNP guests
KVM: X86: Keep the NPT and RMP page level in sync
KVM: x86: Introduce kvm_mmu_get_tdp_walk() for SEV-SNP use
KVM: x86: Define RMP page fault error bits for #NPF
KVM: x86: Update page-fault trace to log full 64-bit error code
KVM: SVM: Do not use long-lived GHCB map while setting scratch area
KVM: SVM: Remove the long-lived GHCB host map
KVM: SVM: Add support to handle GHCB GPA register VMGEXIT
KVM: SVM: Add support to handle MSR based Page State Change VMGEXIT
KVM: SVM: Add support to handle Page State Change VMGEXIT
KVM: SVM: Introduce ops for the post gfn map and unmap
KVM: x86: Export the kvm_zap_gfn_range() for the SNP use
KVM: SVM: Add support to handle the RMP nested page fault
KVM: SVM: Provide support for SNP_GUEST_REQUEST NAE event
KVM: SVM: Add module parameter to enable the SEV-SNP

Sean Christopherson (2):
KVM: x86/mmu: Move 'pfn' variable to caller of direct_page_fault()
KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX and SNP

Tom Lendacky (3):
KVM: SVM: Add support to handle AP reset MSR protocol
KVM: SVM: Use a VMSA physical address variable for populating VMCB
KVM: SVM: Support SEV-SNP AP Creation NAE event

Documentation/virt/coco/sevguest.rst | 55 +
.../virt/kvm/amd-memory-encryption.rst | 102 +
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/kvm-x86-ops.h | 5 +
arch/x86/include/asm/kvm_host.h | 20 +
arch/x86/include/asm/msr-index.h | 6 +
arch/x86/include/asm/sev-common.h | 28 +
arch/x86/include/asm/sev.h | 45 +
arch/x86/include/asm/svm.h | 7 +
arch/x86/include/asm/trap_pf.h | 18 +-
arch/x86/kernel/cpu/amd.c | 3 +-
arch/x86/kernel/sev.c | 361 ++++
arch/x86/kvm/lapic.c | 5 +-
arch/x86/kvm/mmu.h | 7 +-
arch/x86/kvm/mmu/mmu.c | 84 +-
arch/x86/kvm/svm/sev.c | 1676 ++++++++++++++++-
arch/x86/kvm/svm/svm.c | 62 +-
arch/x86/kvm/svm/svm.h | 74 +-
arch/x86/kvm/trace.h | 40 +-
arch/x86/kvm/x86.c | 92 +-
arch/x86/mm/fault.c | 84 +-
drivers/crypto/ccp/sev-dev.c | 924 ++++++++-
drivers/crypto/ccp/sev-dev.h | 17 +
drivers/crypto/ccp/sp-pci.c | 12 +
drivers/iommu/amd/init.c | 30 +
include/linux/iommu.h | 9 +
include/linux/mm.h | 6 +-
include/linux/psp-sev.h | 346 ++++
include/linux/sev.h | 32 +
include/uapi/linux/kvm.h | 56 +
include/uapi/linux/psp-sev.h | 60 +
mm/memory.c | 13 +
tools/arch/x86/include/asm/cpufeatures.h | 1 +
34 files changed, 4088 insertions(+), 201 deletions(-)
create mode 100644 include/linux/sev.h