Re: [PATCH v9 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

From: Jim Mattson

Date: Mon Apr 06 2026 - 10:34:10 EST

On Fri, Apr 3, 2026 at 8:50 PM Pawan Gupta
<pawan.kumar.gupta@xxxxxxxxxxxxxxx> wrote:
>
> On Fri, Apr 03, 2026 at 07:21:02PM -0700, Jim Mattson wrote:
> > On Fri, Apr 3, 2026 at 5:22 PM Pawan Gupta
> > <pawan.kumar.gupta@xxxxxxxxxxxxxxx> wrote:
> > >
> > > On Fri, Apr 03, 2026 at 04:39:54PM -0700, Jim Mattson wrote:
> > > > > Since cloud providers have greater control over userspace, the decision to
> > > > > use BHI_DIS_S or not can be left to them. KVM would simply follow what it
> > > > > is asked to do by the userspace.
> > > >
> > > > I feel like we've gone over this before, but if userspace tells KVM
> > > > not to enable BHI_DIS_S, how do we inform Windows that it needs to do
> > > > the longer clearing sequence, despite the fact that the virtual CPU is
> > > > masquerading as Ice Lake?
> > >
> > > IMO, if an OS is allergic to a hardware mitigation, and is also aware that
> > > it is virtualized, it should default to a sw mitigation that works everywhere.
> >
> > Agreed. So, without any information to the contrary, VMs should assume
> > the long BHB clearing sequence is required.
> >
> > Returning to my earlier comment, the test should be:
> >
> > + if (cpu_feature_enabled(X86_FEATURE_BHI_CTRL) ||
> > cpu_feature_enabled(X86_FEATURE_HYPERVISOR)) {
> > + bhb_seq_outer_loop = 12;
> > + bhb_seq_inner_loop = 7;
> > + }
>
> To be clear, my comment was for an OS that doesn't want BHI_DIS_S
> under-the-hood with virtual-SPEC_CTRL. Linux doesn't have that problem,
> hardware mitigation on Linux is perfectly okay.

Today, BHI_DIS_S under-the-hood isn't offered. If the hypervisor
doesn't offer the paravirtual mitigation MSRs, the guest must assume
that the hypervisor will not set BHI_DIS_S on its behalf.

> Without virtual-SPEC_CTRL, the problem set is limited to guests that
> migrate accross Alder Lake generation CPUs. As you mentioned the change in
> MAXPHYADDR makes it unlikely.

I have been unable to make a compelling argument for not crossing this
boundary. The only applications I can point to that are broken by the
missing reserved bits are (nested) hypervisors using shadow-paging.
Since both nVMX and nSVM support TDP, the niche cache isn't a concern.
There are compelling business reasons to support seamless migration
from pre-Alder Lake to post-Alder Lake. If you know of any other
applications that will fail with a mis-emulated smaller MAXPHYADDR,
please let me know.

> With virtual-SPEC_CTRL support, guests that fall into the subset that
> migrate inspite of MAXPHYADDR change would also be mitigated. Then, on top
> of hardware mitigation, deploying the long sequence in the guest would
> incur a significant performance penalty for no good reason.

Yes, but the guest needs a way to determine whether the hypervisor
will do what's necessary to make the short sequence effective. And, in
particular, no KVM hypervisor today is prepared to do that.

When running under a hypervisor, without BHI_CTRL and without any
evidence to the contrary, the guest must assume that the longer
sequence is necessary. At the very least, we need a CPUID or MSR bit
that says, "the short BHB clearing sequence is adequate for this
vCPU."