Re: [PATCH v8 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

From: Pawan Gupta

Date: Wed Apr 01 2026 - 04:13:09 EST

On Sat, Mar 28, 2026 at 10:08:37AM +0000, David Laight wrote:
> On Fri, 27 Mar 2026 17:42:56 -0700
> Pawan Gupta <pawan.kumar.gupta@xxxxxxxxxxxxxxx> wrote:
>
> > On Thu, Mar 26, 2026 at 01:29:31PM -0700, Pawan Gupta wrote:
> > > On Thu, Mar 26, 2026 at 10:45:57AM +0000, David Laight wrote:
> > > > On Thu, 26 Mar 2026 11:01:20 +0100
> > > > Borislav Petkov <bp@xxxxxxxxx> wrote:
> > > >
> > > > > On Thu, Mar 26, 2026 at 01:39:34AM -0700, Pawan Gupta wrote:
> > > > > > I believe the equivalent for cpu_feature_enabled() in asm is the
> > > > > > ALTERNATIVE. Please let me know if I am missing something.
> > > > >
> > > > > Yes, you are.
> > > > >
> > > > > The point is that you don't want to stick those alternative calls inside some
> > > > > magic bhb_loop function but hand them in from the outside, as function
> > > > > arguments.
> > > > >
> > > > > Basically what I did.
> > > > >
> > > > > Then you were worried about this being C code and it had to be noinstr... So
> > > > > that outer function can be rewritten in asm, I think, and still keep it well
> > > > > separate.
> > > > >
> > > > > I'll try to rewrite it once I get a free minute, and see how it looks.
> > > > >
> > > >
> > > > I think someone tried getting C code to write the values to global data
> > > > and getting the asm to read them.
> > > > That got discounted because it spilt things between two largely unrelated files.
> > >
> > >
> > > The implementation with global variables wasn't that bad, let me revive it.
> > >
> > > This part which ties sequence to BHI mitigation, which is not ideal,
> > > (because VMSCAPE also uses it) it does seems a cleaner option.
> > >
> > > --- a/arch/x86/kernel/cpu/bugs.c
> > > +++ b/arch/x86/kernel/cpu/bugs.c
> > > @@ -2095,6 +2095,11 @@ static void __init bhi_select_mitigation(void)
> > >
> > > static void __init bhi_update_mitigation(void)
> > > {
> > > + if (!cpu_feature_enabled(X86_FEATURE_BHI_CTRL)) {
> > > + bhi_seq_outer_loop = 5;
> > > + bhi_seq_inner_loop = 5;
> > > + }
> > > +
> > >
> > > I believe this can be moved to somewhere common to all mitigations.
> > >
> > > > I think the BPF code would need significant refactoring to call a C function.
> > >
> > > Ya, true. Will use globals and keep clear_bhb_loop() in asm.
> >
> > While testing this approach, I noticed that syscalls were suffering an 8%
> > regression on ICX for Native BHI mitigation:
> >
> > $ perf bench syscall basic -l 100000000
> >
> > Bisection pointed to the change for using 8-bit registers (al/ah replacing
> > eax/ecx) as the main contributor to the regression. (Global variables added
> > a bit, but within noise).
> >
> > Further digging revealed a strange behavior, using %ah for the inner loop
> > was causing the regression, interchanging %al and %ah in the loops
> > (for movb and sub) eliminated the regression.
> >
> > <clear_bhb_loop_nofence>:
> >
> > movb bhb_seq_outer_loop(%rip), %al
> >
> > call 1f
> > jmp 5f
> > 1: call 2f
> > .Lret1: RET
> > 2: movb bhb_seq_inner_loop(%rip), %ah
> > 3: jmp 4f
> > nop
> > 4: sub $1, %ah <---- No regression with %al here
> > jnz 3b
> > sub $1, %al
> > jnz 1b
> >
> > My guess is, "sub $1, %al" is faster than "sub $1, %ah". Using %al in the
> > inner loop, which is executed more number of times is likely making the
> > difference. A perf profile is needed to confirm this.
>
> I bet it is also CPU dependant - it is quite likely that there isn't
> any special hardware to support partial writes of %ah so it ends up taking
> a slow path (possibly even a microcoded one to get an 8% regression).

Strangely, %ah in the inner loop incurs less uops and has fewer branch
misses, yet takes more cycles. Below is the perf data for the sequence on a
Rocket Lake (similar observation on ICX and EMR):

Event %al inner %ah inner Delta
---------------------- ------------- ------------- ----------
cycles 776,775,020 972,322,384 +25.2%
instructions/cycle 1.23 0.98 -20.3%
branch-misses 4,792,502 560,449 -88.3%
uops_issued.any 768,019,010 696,888,357 -9.3%
time elapsed 0.1627s 0.2048s +25.9%

Time elapsed directly correlates with the increase in cycles.

> As well as swapping %al <-> %ah try changing the outer loop decrement to
> sub $0x100, %ax
> since %al is zero that will set the z flag the same.

Unfortunately, using "sub $0x100, %ax"(with %al as inner loop) isn't better
than just using "sub $1, %ah" in the outer loop:

Event %al inner + sub %ax Delta
---------------------- ------------- ------------- ----------
cycles 776,775,020 813,372,036 +4.7%
instructions/cycle 1.23 1.17 -4.5%
branch-misses 4,792,502 7,610,323 +58.8%
uops_issued.any 768,019,010 827,465,137 +7.7%
time elapsed 0.1627s 0.1707s +4.9%

> I've just hacked a test into some test code I've got.
> I'm not seeing an unexpected costs on either zen-5 or haswell.
> So it may be more subtle.

This is puzzling, but atleast it is evident that using %al for the inner
loop seems to be the best option. In summary:

Variant Cycles Uops Issued Branch Misses
------- ---------- ----------- -------------
%al 776M 768M 4.8M (fastest)
%ah 972M (+25%) 697M (-9%) 560K (-88%) (fewer uops + misses, yet slowest)
sub %ax 813M (+5%) 827M (+8%) 7.6M (+59%) (most uops + misses)