Re: [PATCH v2] x86/mm: Don't disable INVLPG if "incomplete Global INVLPG flushes" is fixed by microcode

From: Sean Christopherson
Date: Thu Apr 04 2024 - 13:28:24 EST


On Thu, Apr 04, 2024, Andrew Cooper wrote:
> On 04/04/2024 5:18 pm, Sean Christopherson wrote:
> > On Mon, Mar 25, 2024, Michael Kelley wrote:
> >>> static void setup_pcid(void)
> >>> {
> >>> + const struct x86_cpu_id *invlpg_miss_match;
> >>> +
> >>> if (!IS_ENABLED(CONFIG_X86_64))
> >>> return;
> >>>
> >>> if (!boot_cpu_has(X86_FEATURE_PCID))
> >>> return;
> >>>
> >>> - if (x86_match_cpu(invlpg_miss_ids)) {
> >>> + invlpg_miss_match = x86_match_cpu(invlpg_miss_ids);
> >>> + if (invlpg_miss_match &&
> >>> + invlpg_miss_match->driver_data > boot_cpu_data.microcode) {
> >>> pr_info("Incomplete global flushes, disabling PCID");
> >>> setup_clear_cpu_cap(X86_FEATURE_PCID);
> >>> return;
> >> As noted in similar places where microcode versions are
> >> checked, hypervisors often lie to guests about microcode versions.
> >> For example, see comments in bad_spectre_microcode(). I
> >> know Hyper-V guests always see the microcode version as
> >> 0xFFFFFFFF (max u32 value). So in a Hyper-V guest the above
> >> code will always leave PCID enabled.
> > Enumerating broken PCID support to a guest is very arguably a hypervisor bug.
> > Hypervisors also lie to guest about FMS. As KVM *user* with affected hardware
> > (home box), I would want the kernel to assume PCID works if X86_FEATURE_HYPERVISOR
> > is present.
>
> I have very mixed feelings about all of this.
>
> The Gracemont INVLPG vs PCID bug was found in the field, so PCID will
> have been enumerated to guests at that point.  You can't blindly drop
> PCID until the VM next reboots.
>
> A related example.  I wrote the patch to hide XSAVES to work around an
> AMD erratum where XSAVEC sufficed, and the consequences were so dire for
> some versions of Windows that there was a suggestion to simply revert
> the workaround to make VMs run again.  Windows intentionally asserts
> sanity (== expectations) in what it can see; I have no idea whether it
> would object in this case but hiding PCID is definitely playing with fire.

Yeah, KVM users got burned by that too. d52734d00b8e ("KVM: x86: Give a hint when
Win2016 might fail to boot due to XSAVES erratum").

But practically speaking, that ship has sailed for KVM, as KVM advertises PCID
support if and only if the host kernel supports it, i.e. clearing X86_FEATURE_PCID
in setup_pcid() means KVM stops advertising to userspace, which in turn means
QEMU stops enumerating it VMs by default.

> I am frequently dismayed at how many FMS abuses there are in Linux, and
> I'm this >< close to talking a leaf out of HyperV's book and poisoning
> FMS to 0 or ~0 just like the microcode revision.   Any use of FMS for
> anything other than diagnostic purposes under virt is live migration bug
> waiting to happen.
>
> Xen's current TLB algorithms ensure never to have both PCID and PGE
> active together, so we managed to dodge this particular mess.  But as a
> consequence, we've got no logic to spot it, or to consider changing PCID
> visibility.  That said, for better or worse, the ucode revision is
> visible (for now), and a guest which polls the revision will even spot
> the hypervisor doing a late ucode load.
>
> Sorry I don't have any better suggestions.  The only nice(ish) of fixing
> this for the guest kernel is for Intel to allocate a $FOO_NO bit, and
> it's horrible in every other way.

Hmm, one crazy idea would be to carve out a hypervisor CPUID range for enumerating
(potentially) broken features. Dealing with the Intel/AMD (and Centaur, LOL),
0 / 0x8000_0000 split would be annoying, but not hard. E.g. use 0x4{0,8,C}01_xxxx
to enumerate broken features, and then the guest could do:

support = CPUID(leaf).reg & ~CPUID(to_pv_broken(leaf)).reg;

It'd require a decent amount of churn for the initial support, but it would give
hypervisors a way to inform guests that _any_ CPUID-based feature is broken,
without requiring guest changes (after the initial code is merged) or explicit
action from hardware vendors.

And if we got Windows/Hyper-V in on the game, it would allow them to keep their
sanity checks while (hopefully) degrading gracefully if a feature is enumerated
as broken.