Re: [PATCH] KVM: VMX: Flush shadow VMCS on emergency reboot

From: Chao Gao
Date: Mon Apr 14 2025 - 09:19:09 EST


On Fri, Apr 11, 2025 at 09:57:55AM -0700, Sean Christopherson wrote:
>On Fri, Apr 11, 2025, Chao Gao wrote:
>> On Thu, Apr 10, 2025 at 02:55:29PM -0700, Sean Christopherson wrote:
>> >On Mon, Mar 24, 2025, Chao Gao wrote:
>> >> Ensure the shadow VMCS cache is evicted during an emergency reboot to
>> >> prevent potential memory corruption if the cache is evicted after reboot.
>> >
>> >I don't suppose Intel would want to go on record and state what CPUs would actually
>> >be affected by this bug. My understanding is that Intel has never shipped a CPU
>> >that caches shadow VMCS state.
>>
>> I am not sure. Would you like me to check internally?
>
>Eh, if it's easy, it'd be nice to have, but don't put much effort into it. I'm
>probably being too cute in hesitating about sending this to stable@. The risk
>really shouldn't be high.

I've raised this question internally and will get back once I have an answer.

>
>> However, SDM Chapter 26.11 includes a footnote stating:
>> "
>> As noted in Section 26.1, execution of the VMPTRLD instruction makes a VMCS is
>> active. In addition, VM entry makes active any shadow VMCS referenced by the
>> VMCS link pointer in the current VMCS. If a shadow VMCS is made active by VM
>> entry, it is necessary to execute VMCLEAR for that VMCS before allowing that
>> VMCS to become active on another logical processor.
>> "
>>
>> To me, this suggests that shadow VMCS may be cached, and software shouldn't
>> assume the CPU won't cache it. But, I don't know if this is the reality or
>> if the statement is merely for hardware implementation flexibility.
>>
>> >
>> >On a very related topic, doesn't SPR+ now flush the VMCS caches on VMXOFF? If
>>
>> Actually this behavior is not publicly documented.
>
>Well shoot. That should probably be remedied. Even if the behavior is guaranteed
>only on CPUs that support SEAM, _that_ detail should be documented. I'm not
>holding my breath on Intel allowing third party code in SEAM, but the mode _is_
>documented in the SDM, and so IMO, the SDM should also document how things like
>clearing the VMCS cache are supposed to work when there are VMCSes that "untrusted"
>software may not be able to access.

I'm also inquiring whether all VMCSs are flushed or just SEAM VMCSs, and
whether this behavior can be made public.

A related topic is why KVM is flushing VMCSs. I haven't found any explicit
statement in the SDM indicating that the flush is necessary.

SDM chapter 26.11 mentions:

If a logical processor leaves VMX operation, any VMCSs active on that logical
processor may be corrupted (see below). To prevent such corruption of a VMCS
that may be used either after a return to VMX operation or on another logical
processor, software should execute VMCLEAR for that VMCS before executing the
VMXOFF instruction or removing power from the processor (e.g., as part of a
transition to the S3 and S4 power states).

To me, the issue appears to be VMCS corruption after leaving VMX operation and
the flush is necessary only if you intend to use the VMCS after re-entering VMX
operation.

>From previous KVM commits, I find two different reasons for flushing VMCSs:

- Ensure VMCSs in vmcore aren't corrupted [1]
- Prevent the writeback of VMCS memory on its eviction from overwriting random
memory in the new kernel [2]

The first reason makes sense and aligns with the SDM. However, the second lacks
explicit support from the SDM, suggesting either a gap in the SDM or simply our
misinterpretation. So, I will take this opportunity to seek clarification.

[1]: https://lore.kernel.org/kvm/50C0BB90.1080804@xxxxxxxxx/
[2]: https://lore.kernel.org/kvm/20200321193751.24985-2-sean.j.christopherson@xxxxxxxxx/

>
>> >that's going to be the architectural behavior going forward, will that behavior
>> >be enumerated to software? Regardless of whether there's software enumeration,
>> >I would like to have the emergency disable path depend on that behavior. In part
>> >to gain confidence that SEAM VMCSes won't screw over kdump, but also in light of
>> >this bug.
>>
>> I don't understand how we can gain confidence that SEAM VMCSes won't screw
>> over kdump.
>
>If KVM relies on VMXOFF to purge the VMCS cache, then it gives a measure of
>confidence that running TDX VMs won't leave behind SEAM VMCSes in the cache. KVM
>can't easily clear SEAM VMCSs, but IIRC, the memory can be "forcefully" reclaimed
>by paving over it with MOVDIR64B, at which point having VMCS cache entries for
>the memory would be problematic.
>
>> If a VMM wants to leverage the VMXOFF behavior, software enumeration
>> might be needed for nested virtualization. Using CPU F/M/S (SPR+) to
>> enumerate a behavior could be problematic for virtualization. Right?
>
>Yeah, F/M/S is a bad idea. Architecturally, I think the behavior needs to be
>tied to support for SEAM. Is there a safe-ish way to probe for SEAM support,
>without having to glean it from MSR_IA32_MKTME_KEYID_PARTITIONING?
>
>> >If all past CPUs never cache shadow VMCS state, and all future CPUs flush the
>> >caches on VMXOFF, then this is a glorified NOP, and thus probably shouldn't be
>> >tagged for stable.
>>
>> Agreed.
>>
>> Sean, I am not clear whether you intend to fix this issue and, if so, how.
>> Could you clarify?
>
>Oh, I definitely plan on taking this patch, I'm just undecided on whether or not
>to tag it for stable@.

Thanks.