Re: [PATCH v8 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
From: David Laight
Date: Wed Apr 01 2026 - 05:08:57 EST
On Wed, 1 Apr 2026 01:12:36 -0700
Pawan Gupta <pawan.kumar.gupta@xxxxxxxxxxxxxxx> wrote:
> On Sat, Mar 28, 2026 at 10:08:37AM +0000, David Laight wrote:
> > On Fri, 27 Mar 2026 17:42:56 -0700
> > Pawan Gupta <pawan.kumar.gupta@xxxxxxxxxxxxxxx> wrote:
> >
> > > On Thu, Mar 26, 2026 at 01:29:31PM -0700, Pawan Gupta wrote:
> > > > On Thu, Mar 26, 2026 at 10:45:57AM +0000, David Laight wrote:
> > > > > On Thu, 26 Mar 2026 11:01:20 +0100
> > > > > Borislav Petkov <bp@xxxxxxxxx> wrote:
> > > > >
> > > > > > On Thu, Mar 26, 2026 at 01:39:34AM -0700, Pawan Gupta wrote:
> > > > > > > I believe the equivalent for cpu_feature_enabled() in asm is the
> > > > > > > ALTERNATIVE. Please let me know if I am missing something.
> > > > > >
> > > > > > Yes, you are.
> > > > > >
> > > > > > The point is that you don't want to stick those alternative calls inside some
> > > > > > magic bhb_loop function but hand them in from the outside, as function
> > > > > > arguments.
> > > > > >
> > > > > > Basically what I did.
> > > > > >
> > > > > > Then you were worried about this being C code and it had to be noinstr... So
> > > > > > that outer function can be rewritten in asm, I think, and still keep it well
> > > > > > separate.
> > > > > >
> > > > > > I'll try to rewrite it once I get a free minute, and see how it looks.
> > > > > >
> > > > >
> > > > > I think someone tried getting C code to write the values to global data
> > > > > and getting the asm to read them.
> > > > > That got discounted because it spilt things between two largely unrelated files.
> > > >
> > > >
> > > > The implementation with global variables wasn't that bad, let me revive it.
> > > >
> > > > This part which ties sequence to BHI mitigation, which is not ideal,
> > > > (because VMSCAPE also uses it) it does seems a cleaner option.
> > > >
> > > > --- a/arch/x86/kernel/cpu/bugs.c
> > > > +++ b/arch/x86/kernel/cpu/bugs.c
> > > > @@ -2095,6 +2095,11 @@ static void __init bhi_select_mitigation(void)
> > > >
> > > > static void __init bhi_update_mitigation(void)
> > > > {
> > > > + if (!cpu_feature_enabled(X86_FEATURE_BHI_CTRL)) {
> > > > + bhi_seq_outer_loop = 5;
> > > > + bhi_seq_inner_loop = 5;
> > > > + }
> > > > +
> > > >
> > > > I believe this can be moved to somewhere common to all mitigations.
> > > >
> > > > > I think the BPF code would need significant refactoring to call a C function.
> > > >
> > > > Ya, true. Will use globals and keep clear_bhb_loop() in asm.
> > >
> > > While testing this approach, I noticed that syscalls were suffering an 8%
> > > regression on ICX for Native BHI mitigation:
> > >
> > > $ perf bench syscall basic -l 100000000
> > >
> > > Bisection pointed to the change for using 8-bit registers (al/ah replacing
> > > eax/ecx) as the main contributor to the regression. (Global variables added
> > > a bit, but within noise).
> > >
> > > Further digging revealed a strange behavior, using %ah for the inner loop
> > > was causing the regression, interchanging %al and %ah in the loops
> > > (for movb and sub) eliminated the regression.
> > >
> > > <clear_bhb_loop_nofence>:
> > >
> > > movb bhb_seq_outer_loop(%rip), %al
> > >
> > > call 1f
> > > jmp 5f
> > > 1: call 2f
> > > .Lret1: RET
> > > 2: movb bhb_seq_inner_loop(%rip), %ah
> > > 3: jmp 4f
> > > nop
> > > 4: sub $1, %ah <---- No regression with %al here
> > > jnz 3b
> > > sub $1, %al
> > > jnz 1b
> > >
> > > My guess is, "sub $1, %al" is faster than "sub $1, %ah". Using %al in the
> > > inner loop, which is executed more number of times is likely making the
> > > difference. A perf profile is needed to confirm this.
> >
> > I bet it is also CPU dependant - it is quite likely that there isn't
> > any special hardware to support partial writes of %ah so it ends up taking
> > a slow path (possibly even a microcoded one to get an 8% regression).
>
> Strangely, %ah in the inner loop incurs less uops and has fewer branch
> misses, yet takes more cycles. Below is the perf data for the sequence on a
> Rocket Lake (similar observation on ICX and EMR):
>
> Event %al inner %ah inner Delta
> ---------------------- ------------- ------------- ----------
> cycles 776,775,020 972,322,384 +25.2%
> instructions/cycle 1.23 0.98 -20.3%
> branch-misses 4,792,502 560,449 -88.3%
> uops_issued.any 768,019,010 696,888,357 -9.3%
> time elapsed 0.1627s 0.2048s +25.9%
>
> Time elapsed directly correlates with the increase in cycles.
That might be consistent with the %ah accesses (probably writes)
being very slow/synchronising.
So you are getting a full cpu stall instead speculative execution
of the following instructions - which must include a lot of mis-predicted
branches.
> > As well as swapping %al <-> %ah try changing the outer loop decrement to
> > sub $0x100, %ax
> > since %al is zero that will set the z flag the same.
>
> Unfortunately, using "sub $0x100, %ax"(with %al as inner loop) isn't better
> than just using "sub $1, %ah" in the outer loop:
>
> Event %al inner + sub %ax Delta
> ---------------------- ------------- ------------- ----------
> cycles 776,775,020 813,372,036 +4.7%
> instructions/cycle 1.23 1.17 -4.5%
> branch-misses 4,792,502 7,610,323 +58.8%
> uops_issued.any 768,019,010 827,465,137 +7.7%
> time elapsed 0.1627s 0.1707s +4.9%
That is even more interesting.
The 'sub %ax' version has more uops and more branch-misses.
Looks like the extra cost of the %ah access is less than the cost
of the extra mis-predicted branches.
Makes me wonder where a version that uses %cl fits?
(Or use a zero-extending read and %eax/%ecx - likely to be the same.)
I'll bet 'one beer' that is nearest the 'sub %ax' version.
David
>
> > I've just hacked a test into some test code I've got.
> > I'm not seeing an unexpected costs on either zen-5 or haswell.
> > So it may be more subtle.
>
> This is puzzling, but atleast it is evident that using %al for the inner
> loop seems to be the best option. In summary:
>
> Variant Cycles Uops Issued Branch Misses
> ------- ---------- ----------- -------------
> %al 776M 768M 4.8M (fastest)
> %ah 972M (+25%) 697M (-9%) 560K (-88%) (fewer uops + misses, yet slowest)
> sub %ax 813M (+5%) 827M (+8%) 7.6M (+59%) (most uops + misses)