Re: [PATCH v2 5/6] KVM: x86: Keep a per-VM MTRR state

From: Yan Zhao
Date: Wed May 24 2023 - 07:29:43 EST


On Tue, May 23, 2023 at 05:13:38PM -0700, Sean Christopherson wrote:
> On Tue, May 23, 2023, Yan Zhao wrote:
> > On Sun, May 21, 2023 at 11:44:36AM +0800, Robert Hoo wrote:
> > > On 5/9/2023 9:53 PM, Yan Zhao wrote:
> > > > Keep a per-VM MTRR state and point it to the MTRR state of vCPU 0.
> > > >
> > > > This is a preparation patch for KVM to reference a per-VM guest MTRR
> > > > to decide memory type of EPT leaf entries when noncoherent DMA is present.
> > > >
> > > > Though each vCPU has its own MTRR state, MTRR states should be
> > > > consistent across each VM, which is demanded as in Intel's SDM
> > > > "In a multiprocessor system using a processor in the P6 family or a more
> > > > recent family, each processor MUST use the identical MTRR memory map so
> > > > that software will have a consistent view of memory."
> > > >
> > > > Therefore, when memory type of EPT leaf entry needs to honor guest MTRR,
> > > > a per-VM version of guest MTRR can be referenced.
> > > >
> > > > Each vCPU still has its own MTRR state field to keep guest rdmsr()
> > > > returning the right value when there's lag of MTRR update for each vCPU.
> > > >
> > > Can we get rid of per-vCPU MTRR state copies and just have this per-VM state
> > > only? therefore can simplify implementation and avoid hazard of
> > > inconsistency among per-VPU MTRR states.
> > >
> > > I see in SDM, it notes:
> > > "In multiple processor systems, the operating system must maintain MTRR
> > > consistency between all the processors in the system (that is, all
> > > processors must use the same MTRR values). The P6 and more recent processor
> > > families provide no hardware support for maintaining this consistency."
> > >
> > > leaving each vCPU's MTRR is just to fully mimic HW?
> > >
> > Yes, leaving each vCPU's MTRR to mimic HW.
> >
> > As also suggested in SDM, the guest OS manipulates MTRRs in this way:
> >
> > for each online CPUs {
> > disable MTRR
> > update fixed/var MTRR ranges
> > enable MTRR
> > }
> > Guest OS needs to access memory only after this full pattern.
>
> FWIW, that Linux doesn't use that approach. Linux instead only puts the cache
> into no-fill mode (CR0.CD=1) when modifying MTRRs. OVMF does both (and apparently
> doesn't optimize for self-snoop?).
I think Linux also follows this patten.
This is the call trace I found out in my environment.
cache_cpu_init
cache_disable
write_cr0 to CD=1, NW=0
mtrr_disable
mtrr_generic_set_state
mtrr_wrmsr to fixed/var ranges
cache_enable
mtrr_enable
write_cr0(read_cr0() & ~X86_CR0_CD);

For PAT not enabled in guest,
arch_phys_wc_add
mtrr_add
mtrr_add_page
set_mtrr_cpuslocked
mtrr_rendezvous_handler on each online cpu
generic_set_mtrr
cache_disable
write_cr0 to CD=1, NW=0
mtrr_disable
mtrr_wrmsr
cache_enable


For OVMF and Seabios, I dumped in host to observe their MTRR behaviors:
1. OVMF updates MTRR in below sequence
(1) vCPU 0
CR0.CD=1
MTRR disable
MTRR update
MTRR enable
CR0.CD=0
(2) vCPU 1-n
CR0.CD=1
MTRR disable
(3) vCPU 1-n
MTRR update
(4) vCPU 1-n
MTRR enable
CR0.CD=0

2. Seabios can update MTRRs both when CR0.CD=0 and when CR0.CD=1 and follows below
sequence in each CPU
MTRR disable
MTRR update
MTRR enable
I found it can update MTRRs even when CR0.CD=0 to below values:
MSR_MTRRfix16K_80000(0x258): 0x606060606060606
MSR_MTRRfix4K_C0000(0x268): 0x505050505050505
...
MSR_MTRRfix4K_F8000(0x26f): 0x505050505050505
MTRRphysBase_MSR(0): 0xc0000000
MTRRphysMask_MSR(0): 0x3fffc0000800

>
> > So, I think there should not have "hazard of inconsistency among per-VPU MTRR
> > states".
>
> Probably not, but despite the SDM's assertion that software "MUST" keep things
> consistent, that's not actually reality. E.g. a careful kernel that partitions
> memory doesn't need to strictly keep MTRRs consistent. Ditto for scenarios where
> CPUs are offline (for a variety of definitions of "offline"), in which case only
> online CPUs need to have consistent MTRRs, e.g. before APs are brought up, MTRRs
> are obviously inconsistent.
>
Yes. By "no inconsistency", I mean online vCPUs(who trigger active memory accesses)
need to all in a state that their MTRRs are consistent.
Offline vCPUs, as they will not access memory, it doesn't matter.
But once they are back to online, their MTRRs need to be in sync with online vCPUs.

> > I want to have per-VM MTRR state is because I want to reduce unnessary EPT
> > zap, which costs quite a lot cpu cycles even when the EPT is empty.
> >
> > In this patch, per-VM MTRR pointer is used to point to vCPU 0's MTRR state,
> > so that it can save some memory to keep the MTRR state.
> > But I found out that it would only work when vCPU 0 (boot processor) is
> > always online (which is not true for x86 under some configration).
> >
> > I'll try to find out lowest online vCPU
>
> Picking a single vCPU will always be subject to edge cases. E.g. I can see something
> like kexec() "offlining" KVM's chosen vCPU and then having problems because KVM
> ignores MTRR updates from other vCPUs in the new kernel.
>
Not familiar with kexec().
I wanted to trap APIC_SPIV and finding the lowest online vCPU id by checking
apic->sw_enabled status. Then only MTRR state of vCPU whose id equals to
the lowest online vCPU id can be written to per-VM MTRR state.
Will that work for kexec()?

> One idea would be to let all vCPUs write the per-VM state, and zap if and only if
> the MTRRs are actually different. But as above, I'm on the fence about diverging
> from what hardware actually does with respect to MTRRs.
>
> Argh, stupid quirks. I was going to suggest we take advantage of the guest needing
> to either disable MTRRs or put the cache into no-fill mode to avoid zapping SPTEs,
> i.e. skip the zap if CR0.CD=1, but KVM's implementation of the quirk is all kinds
> of messed up. KVM ignores MTRRs and guest PAT when CR0.CD=1, but then doesn't zap
> when CR0.CD is cleared:
>
> if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
> kvm_mmu_honors_guest_mtrrs(vcpu->kvm) &&
> !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
> kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
>
> which means that either KVM is relying on later MTRR changes to zap the bogus
> WB SPTEs, or more likely things "work" because KVM doesn't install SPTEs for the
> ranges that end up being DMA'd while CR0.CD is set.
>
> *sigh* What an absolutely asinine quirk. KVM should never have taken it on to
> workaround OVMF's stupidity. Good gravy, that thing was even Cc'd stable@. Yeesh.
>
> Aha! Idea. There's should be no need to ignore guest PAT when CR0.CD=1 and the
> quirk is enabled. Even OVMF must be smart enough to map its code as WB in its page
> tables. And there is no need to zap SPTEs when CR0.CD is _set_, because SPTEs
> created while CR0.CD=0 would have honored MTRRs. Lastly, DMA when CR0.CD=1 and
> the quirk is enabled simply can't work since KVM historically ignores everything
> from the guest and maps all memory WB.
>
> So, if KVM zaps SPTEs when CR0.CD is cleared (even when the quirk is enabled),
> then KVM can skip the MTRR zaps when CR0.CD=1 because KVM is ignoring the MTRRs
> and will zap when CR0.CD is cleared. And to avoid regressing the CR0.CD case,
> if KVM honors guest PAT for the bizarro CR0.CD quirk, then KVM only needs to
> zap non-WB MTRR ranges when CR0.CD is cleared. Since WB is weak, looking for
> non-WB MTRR ranges doesn't need to actually resolve the memtype, i.e. can be simple
> and just
> process MTRRs one by one.
>
> Did that make sense? Minus the code to identify non-WB MTRR ranges (and much
> needed comments), the below is what I'm thinking. If more intelligent zapping
> provides the desired performance improvements, then I think/hope we avoid trying
> to play games with MTRRs.
>
> ---
> arch/x86/kvm/mtrr.c | 19 +++++++++++++++++++
> arch/x86/kvm/vmx/vmx.c | 8 ++------
> arch/x86/kvm/x86.c | 6 ++----
> 3 files changed, 23 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
> index a67c28a56417..e700c230268b 100644
> --- a/arch/x86/kvm/mtrr.c
> +++ b/arch/x86/kvm/mtrr.c
> @@ -323,6 +323,9 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
> if (!kvm_mmu_honors_guest_mtrrs(vcpu->kvm))
> return;
>
> + if (kvm_is_cr0_bit_set(vcpu, X86_CR0_CD))
> + return;
This will always make update_mtrr() return here for Linux and OVMF.
> +
> if (!mtrr_is_enabled(mtrr_state) && msr != MSR_MTRRdefType)
> return;
>
> @@ -375,6 +378,22 @@ static void set_var_mtrr_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data)
> }
> }
>
> +void kvm_mtrr_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
> +{
> + if (cr0 & X86_CR0_CD)
> + return;
> +
> + if (!kvm_mmu_honors_guest_mtrrs(vcpu->kvm))
> + return;
> +
> + if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) {
> + kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
> + return;
> + }
> +
> + <zap non-WB memory>;
This zap looks will happen on each vCPU. Then only half of zaps are
saved compared to the original count of zaps in update_mtrr().
And pieces of no-WB memory might produce more kvm_zap_gfn_range()?


> +}
> +
> int kvm_mtrr_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data)
> {
> int index;
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 2d9d155691a7..962abc17afc0 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7502,8 +7502,6 @@ static int vmx_vm_init(struct kvm *kvm)
>
> static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> {
> - u8 cache;
> -
> /* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
> * memory aliases with conflicting memory types and sometimes MCEs.
> * We have to be careful as to what are honored and when.
> @@ -7530,11 +7528,9 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
>
> if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {
> if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
> - cache = MTRR_TYPE_WRBACK;
> + return MTRR_TYPE_WRBACK;
> else
> - cache = MTRR_TYPE_UNCACHABLE;
> -
> - return (cache << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> + return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> }
>
> return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 977dceb7ba7e..3c085b6f9e3c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -941,10 +941,8 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
> if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
> kvm_mmu_reset_context(vcpu);
>
> - if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
> - kvm_mmu_honors_guest_mtrrs(vcpu->kvm) &&
> - !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
> - kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
> + if ((cr0 ^ old_cr0) & X86_CR0_CD))
> + kvm_mtrr_post_set_cr0(vcpu, cr0);
> }
> EXPORT_SYMBOL_GPL(kvm_post_set_cr0);
>
>
> base-commit: 7e998f35c57cb0a2c35f8545b54ab5f4f5a83841
> --
>