Re: [RFC Part2 PATCH 04/30] x86/mm: split the physmap when adding the page in RMP table

From: Brijesh Singh
Date: Mon Apr 19 2021 - 17:25:24 EST



On 4/19/21 1:10 PM, Andy Lutomirski wrote:
>
>> On Apr 19, 2021, at 10:58 AM, Dave Hansen <dave.hansen@xxxxxxxxx> wrote:
>>
>> On 4/19/21 10:46 AM, Brijesh Singh wrote:
>>> - guest wants to make gpa 0x1000 as a shared page. To support this, we
>>> need to psmash the large RMP entry into 512 4K entries. The psmash
>>> instruction breaks the large RMP entry into 512 4K entries without
>>> affecting the previous validation. Now the we need to force the host to
>>> use the 4K page level instead of the 2MB.
>>>
>>> To my understanding, Linux kernel fault handler does not build the page
>>> tables on demand for the kernel addresses. All kernel addresses are
>>> pre-mapped on the boot. Currently, I am proactively spitting the physmap
>>> to avoid running into situation where x86 page level is greater than the
>>> RMP page level.
>> In other words, if the host maps guest memory with 2M mappings, the
>> guest can induce page faults in the host. The only way the host can
>> avoid this is to map everything with 4k mappings.
>>
>> If the host does not avoid this, it could end up in the situation where
>> it gets page faults on access to kernel data structures. Imagine if a
>> kernel stack page ended up in the same 2M mapping as a guest page. I
>> *think* the next write to the kernel stack would end up double-faulting.
> I’m confused by this scenario. This should only affect physical pages that are in the 2M area that contains guest memory. But, if we have a 2M direct map PMD entry that contains kernel data and guest private memory, we’re already in a situation in which the kernel touching that memory would machine check, right?

When SEV-SNP is enabled in the host, a page can be in one of the
following state:

1. Hypevisor  (assigned = 0, Validated=0)

2. Firmware (assigned = 1, immutable=1)

3. Context/VMSA (assigned=1, vmsa=1)

4. Guest private (assigned = 1, Validated=1)


You are right that we should never run into situation where the kernel
data and guest page will be in the same PMD entry. 

During the SEV-VM creation, KVM allocates one firmware page and one vmsa
page for each vcpus. The firmware page is used by the SEV-SNP firmware
to keep some private metadata. The VMSA page contains the guest register
state. I am more concern about the pages allocated by the KVM for the
VMSA and firmware. These pages are not a guest private per se.  To avoid
getting into this situation we can probably create SNP buffer pool. All
the firmware and VMSA pages should come from this pool.

Another challenging one, KVM maps a guest page and does write to it. One
such example is the GHCB page. If the mapped address points to a PMD
entry then we will get an RMP violation.


> ISTM we should fully unmap any guest private page from the kernel and all host user pagetables before actually making it be a guest private page.