Re: [PATCH] KVM: x86: Track supported ARCH_CAPABILITIES in kvm_caps
From: Pawan Gupta
Date: Mon May 22 2023 - 16:55:09 EST
On Mon, May 22, 2023 at 10:43:49AM -0700, Sean Christopherson wrote:
> On Fri, May 19, 2023, Pawan Gupta wrote:
> > On Thu, May 18, 2023 at 10:33:15AM -0700, Sean Christopherson wrote:
> > > I made the mistake of digging into why KVM doesn't advertise ARCH_CAP_FB_CLEAR_CTRL...
> > >
> > > 1. I see *nothing* in commit 027bbb884be0 ("KVM: x86/speculation: Disable Fill
> > > buffer clear within guests") that justifies 1x RDMSR and 2x WRMSR on every
> > > entry+exit.
> >
> > Unnecessary VERWs in guest will have much higher impact than due to MSR
> > read/write at vmentry/exit.
>
> Can you provide numbers for something closeish to a real world workload?
I am collecting the numbers, will update here soon.
> > On an Icelake system it is pointless for a guest to incur VERW penalty when
> > the system is not affected by MDS/TAA and guests don't need mitigation for
> > MMIO Stale Data. MSR writes are only done when the guest is likely to execute
> > unnecessary VERWs(e.g. when the guest thinks its running on an older gen
> > CPU).
> >
> > > KVM just needs to context switch the MSR between guests since the value that's
> > > loaded while running in the host is irrelevant. E.g. use a percpu cache to
> >
> > I will be happy to avoid the MSR read/write, but its worth considering
> > that this MSR can receive more bits that host may want to toggle, then
> > percpu cache implementation would likely change.
>
> Change in and of itself isn't problematic, so long as whatever code we write won't
> fall over if/when new bits are added, i.e. doesn't clobber unknown bits.
Ok.
> > > 5. MSR_IA32_MCU_OPT_CTRL is not modified by the host after a CPU is brought up,
> > > i.e. the host's desired value is effectively static post-boot, and barring
> > > a buggy configuration (running KVM as a guest), the boot CPU's value will be
> > > the same as every other CPU.
> >
> > Would the MSR value be same on every CPU, if only some guests have
> > enumerated FB_CLEAR and others haven't?
>
> Ignore the guest, I'm talking purely about the host. Specifically, there's no
> reason to do a RDMSR to get the host value on every VM-Enter since the host's
> value is effectively static post-boot.
That right(ignoring late microcode load adding stuff to the MSR or
msr-tools fiddling).
> > MSR writes (to disable FB_CLEAR) are not done when a guest enumerates
> > FB_CLEAR. Enumeration of FB_CLEAR in guest will depend on its configuration.
> >
> > > 6. Performance aside, KVM should not be speculating (ha!) on what the guest
> > > will and will not do, and should instead honor whatever behavior is presented
> > > to the guest. If the guest CPU model indicates that VERW flushes buffers,
> > > then KVM damn well needs to let VERW flush buffers.
> >
> > The current implementation allows guests to have VERW flush buffers when
> > they enumerate FB_CLEAR. It only restricts the flush behavior when the
> > guest is trying to mitigate against a vulnerability(like MDS) on a
> > hardware that is not affected. I guess its common for guests to be
> > running with older gen configuration on a newer hardware.
>
> Right, I'm saying that that behavior is wrong. KVM shouldn't assume the guest
> the guest will do things a certain way and should instead honor the "architectural"
> definition, in quotes because I realize there probably is no architectural
> definition for any of this.
Before MMIO Stale Data, processors that were not affected by MDS/TAA did
not clear CPU buffers, even if they enumerated MD_CLEAR. On such
processors guests that deployed VERW(thinking they are vulnerable to
MDS) did not clear the CPU buffers. After MMIO Stale Data was discovered
FB_CLEAR_DIS was introduced to restore this behavior.
> It might be that the code does (unintentionally?) honor the "architecture", i.e.
> this code might actually be accurrate with respect to when the guest can expect
> VERW to flush buffers. But the comment is so, so wrong.
Agree, the comment needs to explain this well.
> /*
> * If guest will not execute VERW, there is no need to set FB_CLEAR_DIS
> * at VMEntry. Skip the MSR read/write when a guest has no use case to
> * execute VERW.
> */
> if ((vcpu->arch.arch_capabilities & ARCH_CAP_FB_CLEAR) ||
> ((vcpu->arch.arch_capabilities & ARCH_CAP_MDS_NO) &&
> (vcpu->arch.arch_capabilities & ARCH_CAP_TAA_NO) &&
> (vcpu->arch.arch_capabilities & ARCH_CAP_PSDP_NO) &&
> (vcpu->arch.arch_capabilities & ARCH_CAP_FBSDP_NO) &&
> (vcpu->arch.arch_capabilities & ARCH_CAP_SBDR_SSDP_NO)))
> vmx->disable_fb_clear = false;