Re: [PATCH v2 5/6] KVM: x86: Keep a per-VM MTRR state

From: Yan Zhao
Date: Thu May 25 2023 - 06:35:06 EST


On Wed, May 24, 2023 at 11:21:09AM -0700, Sean Christopherson wrote:

...

> > > Picking a single vCPU will always be subject to edge cases. E.g. I can see something
> > > like kexec() "offlining" KVM's chosen vCPU and then having problems because KVM
> > > ignores MTRR updates from other vCPUs in the new kernel.
> > >
> > Not familiar with kexec().
> > I wanted to trap APIC_SPIV and finding the lowest online vCPU id by checking
> > apic->sw_enabled status. Then only MTRR state of vCPU whose id equals to
> > the lowest online vCPU id can be written to per-VM MTRR state.
> > Will that work for kexec()?
>
> kexec() allows "booting" into a new kernel without transitioning through BIOS
> and without doing a full reboot. The scenario I'm worried about is if the new
> kernel offlines KVM's chosen CPU for whatever reason and also modifies MTRRs. I
> think it's an extremely unlikely scenario, but my point is that selecting a single
> vCPU to control the MTRRs works if and only if that vCPU stays online for the
> entire lifetime of the VM, which KVM can't guarantee.
>
> > > One idea would be to let all vCPUs write the per-VM state, and zap if and only if
> > > the MTRRs are actually different. But as above, I'm on the fence about diverging
> > > from what hardware actually does with respect to MTRRs.
>
> ...
>
> > > So, if KVM zaps SPTEs when CR0.CD is cleared (even when the quirk is enabled),
> > > then KVM can skip the MTRR zaps when CR0.CD=1 because KVM is ignoring the MTRRs
> > > and will zap when CR0.CD is cleared. And to avoid regressing the CR0.CD case,
> > > if KVM honors guest PAT for the bizarro CR0.CD quirk, then KVM only needs to
> > > zap non-WB MTRR ranges when CR0.CD is cleared. Since WB is weak, looking for
Not only non-WB ranges are required to be zapped.
Think about a scenario:
(1) one fixed MTRR range is UC initially. Its EPT entry memtype is UC, too.
(2) after CR0.CD=1, without zap, its EPT entry memtype is still UC.
(3) then guest performs MTRR disable, no zap too.
(4) guest updates this fixed MTRR range to WB, and performs MTRR enable.
(5) CR0.CD=0. we need to zap this fixed range to update the EPT memtype to WB.

Does it make sense?

> > > non-WB MTRR ranges doesn't need to actually resolve the memtype, i.e. can be simple
> > > and just
> > > process MTRRs one by one.
> > >
> > > Did that make sense? Minus the code to identify non-WB MTRR ranges (and much
> > > needed comments), the below is what I'm thinking. If more intelligent zapping
> > > provides the desired performance improvements, then I think/hope we avoid trying
> > > to play games with MTRRs.
> > >
> > > ---
> > > arch/x86/kvm/mtrr.c | 19 +++++++++++++++++++
> > > arch/x86/kvm/vmx/vmx.c | 8 ++------
> > > arch/x86/kvm/x86.c | 6 ++----
> > > 3 files changed, 23 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
> > > index a67c28a56417..e700c230268b 100644
> > > --- a/arch/x86/kvm/mtrr.c
> > > +++ b/arch/x86/kvm/mtrr.c
> > > @@ -323,6 +323,9 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
> > > if (!kvm_mmu_honors_guest_mtrrs(vcpu->kvm))
> > > return;
> > >
> > > + if (kvm_is_cr0_bit_set(vcpu, X86_CR0_CD))
> > > + return;
> > This will always make update_mtrr() return here for Linux and OVMF.
>
> Yes, that's the intent. If the CR0.CD quirk is _disabled_, then all SPTEs are
> UC, i.e. there's no need to zap. If the quirk is enabled (the common case,
> unfortunately), then KVM isn't honoring MTRRs, i.e. non-coherent DMA won't function
> properly, and so zapping SPTEs is pointless. And in both cases, KVM will zap
> relevant SPTEs when CR0.CD is cleared.
My worry is that if the quirk is enabled,
(1) when CR0.CD=1, the EPT memtype is WB.
(2) when MTRR is further disabled,
a. with EPT zap, the new EPT memtype is UC, effective memtype is UC
b. without EPT zap, some EPT memtype is still WB, effective memtype is guest PAT type
However, in guest driver's point of view, the memtype is UC, because its MTRR is disabled.

So, if we don't zap EPT when guest disables MTRR, and if there's
non-coherent DMA on-going which requires guest driver to flush caches to
ensure cache coherency (though I don't think this scenario will happen in reality), guest
driver may not do the right thing as it thinks the memory is UC which
has no need to flush cache while the effective memtype is WB.

Previously, (i.e. in current upstream implementation),
(1) when CR0.CD=1, the EPT memtype is WB + ignore guest PAT.
(2) when MTRR is further disabled,
EPT is zapped, new EPT memtype is UC, effective memtype is UC.
The guest device driver can flush cache well for non-coherent DMA.

So, though the quirk will make current upstream code malfunction when
CR0.CD is set alone, it still functions well if MTRR is also disabled.

But I doubt if we need to care about this extreme condition of
non-coherent DMA that happens when CR0.CD=1 and MTRR is disabled.


>
> And this is actually already the behavior for all MTRRs execpt for MSR_MTRRdefType
> due to the !mtrr_is_enabled() clause below.
>
> > > if (!mtrr_is_enabled(mtrr_state) && msr != MSR_MTRRdefType)
> > > return;
> > >
> > > @@ -375,6 +378,22 @@ static void set_var_mtrr_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data)
> > > }
> > > }
> > >
> > > +void kvm_mtrr_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
> > > +{
> > > + if (cr0 & X86_CR0_CD)
> > > + return;
> > > +
> > > + if (!kvm_mmu_honors_guest_mtrrs(vcpu->kvm))
> > > + return;
> > > +
> > > + if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) {
> > > + kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
> > > + return;
> > > + }
> > > +
> > > + <zap non-WB memory>;
> > This zap looks will happen on each vCPU.
>
> Yes, on CR0.CD 1=>0 transition.
>
> > Then only half of zaps are saved compared to the original count of zaps in
> > update_mtrr().
>
> I don't think the current code is functionally correct though, even if we define
> "correct" to only apply when CR0.CD=0 (because the CR0.CD=1 case when the quirk
> can't possibly be correct). Keeping stale WB+IGMT entries that were installed
> while CR0.CD=1 is broken. As mentioned in my previous mail, I suspect it works
> only because KVM doesn't install SPTEs for DMA'd ranges. FWIW, the WB memtype
> is unlikely to be a problem, but setting IGMT definitely is. Though of course
> we can and should fix the IGMT thing separately.
Current code may also not function well for non-coherent DMA when guest vCPU has
no X86_FEATURE_MTRR, as it returns WB (without IPAT) as EPT memtype.
Then the effective memtype is guest PAT, but from guest's point of view, it's all
UC. (see pat_x_mtrr_type() in Linux's code).
But I guess we don't need to mind this case?


>
> > And pieces of no-WB memory might produce more kvm_zap_gfn_range()?
>
> Yes, but they'll be much, much more precise. And the bajillion fixed MTRRs can
Could we instead keep a per-VM copy of MTRR?
Then, if a vCPU modifies an MTRR entry and we find it's different from that
in the per-VM copy, we mark that entry dirty. When CD0=0, we just zap
those dirty ranges. (or just zap all if there are dirty entries)
In theory, only one vCPU will trigger this zap in each round of MTRR update.

> be batched into a single zap by sacrificing a tiny amount of precision, i.e. zap
> from the lowest non-WB fixed MTRR to the highest. Crucially, assuming BIOS and
> the kernel aren't doing something bizarre, that should preserve the SPTEs for the
> code the guest is executing from (0 - VGA hole should be WB).
>
> And if we want to squeeze more performance out of this path, there are other
> optimizations we can make. E.g. I'm guessing one of the problems, perhaps even
> _the_ problem, is that there's contention on mmu_lock in kvm_zap_gfn_range() when
> bringing up APs, which is likely why you observe slow downs even when there are
> no SPTEs to zap. A thought for handling that would be to do something akin to
> kvm_recalculate_apic_map(). E.g. instead of having every vCPU zap, track (a)
> if a zap is in-progress and (b) the ranges being zapped. There will still be
> lock contention to add ranges, but it should be fairly short in duration compared
> to zapping all possible SPTEs (current behavior).
Yes, I guess this one is more performant, too :)

>
> Something like
>
> struct tbd *range = NULL;
> bool do_zap = false;
>
> <gather non-wb ranges>
>
> for_each_non_wb_range(...) {
> if (!range)
> range = kmalloc(...);
>
> spin_lock(&kvm->arch.mtrr_zap_lock);
> if (<range not in list>) {
> list_add_tail(&range->list, &kvm->arch.mtrr_zap_list);
> range = NULL;
>
> if (!kvm->arch.mtrr_zapping) {
> do_zap = true;
> kvm->arch.mtrr_zapping = true;
> }
> }
> spin_unlock(&kvm->arch.mtrr_zap_lock);
> }
>
> kfree(zap_entry);
>
> if (do_zap) {
> spin_lock(&kvm->arch.mtrr_zap_lock);
>
> while (!list_empty(&kvm->arch.mtrr_zap_list)) {
> struct tbd *range;
>
> range = list_first_entry(&kvm->arch.mtrr_zap_list);
> list_del(range->list);
>
> spin_unlock(&kvm->arch.mtrr_zap_lock);
>
> kvm_zap_gfn_range(..., range->start, range->end);
> kfree(range);
>
> spin_lock(&kvm->arch.mtrr_zap_lock);
> }
>
> /* Clear under lock to ensure new entries are processed. */
> kvm->arch.mtrr_zapping = false;
>
> spin_unlock(&kvm->arch.mtrr_zap_lock);
> }
>
> /* Wait for the zap to complete. */
> while (READ_ONCE(kvm->arch.mtrr_zapping))
> cpu_relax();
>
> and I'm sure someone that's better at lockless programming could optimize that
> even further if necessary, e.g. by checking for "range in list" outside of the
> spinlock.
>
> E.g. in my OVMF based VM, the precise zap would target only two ranges, the VGA
> hole and the 32-bit PCI:
>
> kvm: vCPU0 default type = 6
> kvm: vCPU0 fixed MTRR range 0 - 15 == 6
> kvm: vCPU0 fixed MTRR range 16 - 87 == 0
> kvm: vCPU0 variable MTRR range 80000000 - 100000000 = 0
>
> That bumps up to three ranges for the kernel, which adds what I suspect is the
> 64-bit PCI hole:
>
> kvm: vCPU0 default type = 6
> kvm: vCPU0 fixed MTRR range 0 - 15 == 6
> kvm: vCPU0 fixed MTRR range 16 - 87 == 0
> kvm: vCPU0 variable MTRR range 800000000 - 1000000000 = 0
> kvm: vCPU0 variable MTRR range 80000000 - 100000000 = 0
>
> The above is distilled information from this hack-a-print:
>
> diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
> index 9fac1ec03463..6259c7a4bcd3 100644
> --- a/arch/x86/kvm/mtrr.c
> +++ b/arch/x86/kvm/mtrr.c
> @@ -304,12 +304,42 @@ static void var_mtrr_range(struct kvm_mtrr_range *range, u64 *start, u64 *end)
> *end = (*start | ~mask) + 1;
> }
>
> +
> +static bool var_mtrr_range_is_valid(struct kvm_mtrr_range *range)
> +{
> + return (range->mask & (1 << 11)) != 0;
> +}
> +
> static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
> {
> struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state;
> gfn_t start, end;
> int index;
>
> + if (mtrr_is_enabled(mtrr_state) && msr == MSR_MTRRdefType) {
> + struct kvm_mtrr_range *range;
> + int i;
> +
> + pr_warn("vCPU%u default type = %u\n", vcpu->vcpu_idx, (u8)(mtrr_state->deftype & 0xff));
> +
> + if (!fixed_mtrr_is_enabled(mtrr_state)) {
> + pr_warn("vCPU%u fixed MTRRs disabled\n", vcpu->vcpu_idx);
> + } else {
> + for (i = 0; i < ARRAY_SIZE(mtrr_state->fixed_ranges); i++)
> + pr_warn("vCPU%u fixed MTRR range %u == %u\n",
> + vcpu->vcpu_idx, i, mtrr_state->fixed_ranges[i]);
> + }
> +
> + list_for_each_entry(range, &mtrr_state->head, node) {
> + if (!var_mtrr_range_is_valid(range))
> + continue;
> +
> + var_mtrr_range(range, &start, &end);
> + pr_warn("vCPU%d variable MTRR range %llx - %llx = %u\n",
> + vcpu->vcpu_idx, start, end, (u8)(range->base & 0xff));
> + }
> + }
> +
> if (msr == MSR_IA32_CR_PAT || !tdp_enabled ||
> !kvm_arch_has_noncoherent_dma(vcpu->kvm))
> return;
> @@ -333,11 +363,6 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
> kvm_zap_gfn_range(vcpu->kvm, gpa_to_gfn(start), gpa_to_gfn(end));
> }
>
> -static bool var_mtrr_range_is_valid(struct kvm_mtrr_range *range)
> -{
> - return (range->mask & (1 << 11)) != 0;
> -}
> -
> static void set_var_mtrr_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data)
> {
> struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state;
>