Re: [PATCH v8 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

From: David Laight

Date: Sat Mar 28 2026 - 11:07:02 EST

On Fri, 27 Mar 2026 17:42:56 -0700
Pawan Gupta <pawan.kumar.gupta@xxxxxxxxxxxxxxx> wrote:

> On Thu, Mar 26, 2026 at 01:29:31PM -0700, Pawan Gupta wrote:
> > On Thu, Mar 26, 2026 at 10:45:57AM +0000, David Laight wrote:
> > > On Thu, 26 Mar 2026 11:01:20 +0100
> > > Borislav Petkov <bp@xxxxxxxxx> wrote:
> > >
> > > > On Thu, Mar 26, 2026 at 01:39:34AM -0700, Pawan Gupta wrote:
> > > > > I believe the equivalent for cpu_feature_enabled() in asm is the
> > > > > ALTERNATIVE. Please let me know if I am missing something.
> > > >
> > > > Yes, you are.
> > > >
> > > > The point is that you don't want to stick those alternative calls inside some
> > > > magic bhb_loop function but hand them in from the outside, as function
> > > > arguments.
> > > >
> > > > Basically what I did.
> > > >
> > > > Then you were worried about this being C code and it had to be noinstr... So
> > > > that outer function can be rewritten in asm, I think, and still keep it well
> > > > separate.
> > > >
> > > > I'll try to rewrite it once I get a free minute, and see how it looks.
> > > >
> > >
> > > I think someone tried getting C code to write the values to global data
> > > and getting the asm to read them.
> > > That got discounted because it spilt things between two largely unrelated files.
> >
> >
> > The implementation with global variables wasn't that bad, let me revive it.
> >
> > This part which ties sequence to BHI mitigation, which is not ideal,
> > (because VMSCAPE also uses it) it does seems a cleaner option.
> >
> > --- a/arch/x86/kernel/cpu/bugs.c
> > +++ b/arch/x86/kernel/cpu/bugs.c
> > @@ -2095,6 +2095,11 @@ static void __init bhi_select_mitigation(void)
> >
> > static void __init bhi_update_mitigation(void)
> > {
> > + if (!cpu_feature_enabled(X86_FEATURE_BHI_CTRL)) {
> > + bhi_seq_outer_loop = 5;
> > + bhi_seq_inner_loop = 5;
> > + }
> > +
> >
> > I believe this can be moved to somewhere common to all mitigations.
> >
> > > I think the BPF code would need significant refactoring to call a C function.
> >
> > Ya, true. Will use globals and keep clear_bhb_loop() in asm.
>
> While testing this approach, I noticed that syscalls were suffering an 8%
> regression on ICX for Native BHI mitigation:
>
> $ perf bench syscall basic -l 100000000
>
> Bisection pointed to the change for using 8-bit registers (al/ah replacing
> eax/ecx) as the main contributor to the regression. (Global variables added
> a bit, but within noise).
>
> Further digging revealed a strange behavior, using %ah for the inner loop
> was causing the regression, interchanging %al and %ah in the loops
> (for movb and sub) eliminated the regression.
>
> <clear_bhb_loop_nofence>:
>
> movb bhb_seq_outer_loop(%rip), %al
>
> call 1f
> jmp 5f
> 1: call 2f
> .Lret1: RET
> 2: movb bhb_seq_inner_loop(%rip), %ah
> 3: jmp 4f
> nop
> 4: sub $1, %ah <---- No regression with %al here
> jnz 3b
> sub $1, %al
> jnz 1b
>
> My guess is, "sub $1, %al" is faster than "sub $1, %ah". Using %al in the
> inner loop, which is executed more number of times is likely making the
> difference. A perf profile is needed to confirm this.

I bet it is also CPU dependant - it is quite likely that there isn't
any special hardware to support partial writes of %ah so it ends up taking
a slow path (possibly even a microcoded one to get an 8% regression).

As well as swapping %al <-> %ah try changing the outer loop decrement to
sub $0x100, %ax
since %al is zero that will set the z flag the same.

I've just hacked a test into some test code I've got.
I'm not seeing an unexpected costs on either zen-5 or haswell.
So it may be more subtle.

David

>
> Never imagined a register selection can make an 8% difference in
> performance! Anyways, will update the patch with this finding.