Re: [PATCH v8 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

From: Pawan Gupta

Date: Fri Mar 27 2026 - 20:43:24 EST

On Thu, Mar 26, 2026 at 01:29:31PM -0700, Pawan Gupta wrote:
> On Thu, Mar 26, 2026 at 10:45:57AM +0000, David Laight wrote:
> > On Thu, 26 Mar 2026 11:01:20 +0100
> > Borislav Petkov <bp@xxxxxxxxx> wrote:
> >
> > > On Thu, Mar 26, 2026 at 01:39:34AM -0700, Pawan Gupta wrote:
> > > > I believe the equivalent for cpu_feature_enabled() in asm is the
> > > > ALTERNATIVE. Please let me know if I am missing something.
> > >
> > > Yes, you are.
> > >
> > > The point is that you don't want to stick those alternative calls inside some
> > > magic bhb_loop function but hand them in from the outside, as function
> > > arguments.
> > >
> > > Basically what I did.
> > >
> > > Then you were worried about this being C code and it had to be noinstr... So
> > > that outer function can be rewritten in asm, I think, and still keep it well
> > > separate.
> > >
> > > I'll try to rewrite it once I get a free minute, and see how it looks.
> > >
> >
> > I think someone tried getting C code to write the values to global data
> > and getting the asm to read them.
> > That got discounted because it spilt things between two largely unrelated files.
>
>
> The implementation with global variables wasn't that bad, let me revive it.
>
> This part which ties sequence to BHI mitigation, which is not ideal,
> (because VMSCAPE also uses it) it does seems a cleaner option.
>
> --- a/arch/x86/kernel/cpu/bugs.c
> +++ b/arch/x86/kernel/cpu/bugs.c
> @@ -2095,6 +2095,11 @@ static void __init bhi_select_mitigation(void)
>
> static void __init bhi_update_mitigation(void)
> {
> + if (!cpu_feature_enabled(X86_FEATURE_BHI_CTRL)) {
> + bhi_seq_outer_loop = 5;
> + bhi_seq_inner_loop = 5;
> + }
> +
>
> I believe this can be moved to somewhere common to all mitigations.
>
> > I think the BPF code would need significant refactoring to call a C function.
>
> Ya, true. Will use globals and keep clear_bhb_loop() in asm.

While testing this approach, I noticed that syscalls were suffering an 8%
regression on ICX for Native BHI mitigation:

$ perf bench syscall basic -l 100000000

Bisection pointed to the change for using 8-bit registers (al/ah replacing
eax/ecx) as the main contributor to the regression. (Global variables added
a bit, but within noise).

Further digging revealed a strange behavior, using %ah for the inner loop
was causing the regression, interchanging %al and %ah in the loops
(for movb and sub) eliminated the regression.

<clear_bhb_loop_nofence>:

movb bhb_seq_outer_loop(%rip), %al

call 1f
jmp 5f
1: call 2f
.Lret1: RET
2: movb bhb_seq_inner_loop(%rip), %ah
3: jmp 4f
nop
4: sub $1, %ah <---- No regression with %al here
jnz 3b
sub $1, %al
jnz 1b

My guess is, "sub $1, %al" is faster than "sub $1, %ah". Using %al in the
inner loop, which is executed more number of times is likely making the
difference. A perf profile is needed to confirm this.

Never imagined a register selection can make an 8% difference in
performance! Anyways, will update the patch with this finding.