Re: [PATCH v8 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
From: Pawan Gupta
Date: Wed Apr 01 2026 - 14:56:29 EST
On Wed, Apr 01, 2026 at 10:02:00AM +0100, David Laight wrote:
> > > As well as swapping %al <-> %ah try changing the outer loop decrement to
> > > sub $0x100, %ax
> > > since %al is zero that will set the z flag the same.
> >
> > Unfortunately, using "sub $0x100, %ax"(with %al as inner loop) isn't better
> > than just using "sub $1, %ah" in the outer loop:
> >
> > Event %al inner + sub %ax Delta
> > ---------------------- ------------- ------------- ----------
> > cycles 776,775,020 813,372,036 +4.7%
> > instructions/cycle 1.23 1.17 -4.5%
> > branch-misses 4,792,502 7,610,323 +58.8%
> > uops_issued.any 768,019,010 827,465,137 +7.7%
> > time elapsed 0.1627s 0.1707s +4.9%
>
> That is even more interesting.
> The 'sub %ax' version has more uops and more branch-misses.
> Looks like the extra cost of the %ah access is less than the cost
> of the extra mis-predicted branches.
>
> Makes me wonder where a version that uses %cl fits?
> (Or use a zero-extending read and %eax/%ecx - likely to be the same.)
> I'll bet 'one beer' that is nearest the 'sub %ax' version.
%cl didn't make a noticeable difference, but ...
Event %al/%ah %al/%cl Delta
(inner/outer) (inner/outer)
---------------------- ------------- ------------- ----------
cycles 776,380,149 778,294,183 +0.2%
instructions/cycle 1.23 1.22 -0.4%
branch-misses 4,986,437 5,679,599 +13.9%
uops_issued.any 773,223,387 765,724,878 -1.0%
time elapsed 0.1631s 0.1637s +0.4%
... there are meaningful gains with 32-bit registers:
Event %al/%ah %eax/%ecx Delta
(inner/outer) (inner/outer)
---------------------- ------------- ------------- ----------
cycles 776,380,149 706,331,177 -9.0%
instructions/cycle 1.23 1.35 +9.9%
branch-misses 4,986,437 6,089,306 +22.1%
uops_issued.any 773,223,387 774,539,522 +0.2%
time elapsed 0.1631s 0.1482s -9.1%
These values are for userspace tests with immediates. Next, I will test how
they perform with memory loads in kernel. Before we finalize these uarch
nuances needs to be tested on a variety of CPUs.