Re: [PATCH -tip] x86/locking/atomic: Use asm_inline for atomic locking insns

From: David Laight
Date: Wed Mar 05 2025 - 15:14:40 EST


On Wed, 5 Mar 2025 07:04:08 -1000
Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxx> wrote:

> On Tue, 4 Mar 2025 at 22:54, Uros Bizjak <ubizjak@xxxxxxxxx> wrote:
> >
> > Even to my surprise, the patch has some noticeable effects on the
> > performance, please see the attachment in [1] for LMBench data or [2]
> > for some excerpts from the data. So, I think the patch has potential
> > to improve the performance.
>
> I suspect some of the performance difference - which looks
> unexpectedly large - is due to having run them on a CPU with the
> horrendous indirect return costs, and then inlining can make a huge
> difference.
...

Another possibility is that the processes are getting bounced around
cpu in a slightly different way.
An idle cpu might be running at 800MHz, run something that spins on it
and the clock speed will soon jump to 4GHz.
But if your 'spinning' process is migrated to a different cpu it starts
again at 800MHz.

(I had something where a fpga compile when from 12 mins to over 20 because
the kernel RSB stuffing caused the scheduler to behave differently even
though nothing was doing a lot of system calls.)

All sorts of things can affect that - possibly even making some code faster!

The (IIRC) 30k increase in code size will be a few functions being inlined.
The bloat-o-meter might show which, and forcing a few inlines the same way
should reduce that difference.
OTOH I'm surprised that a single (or two) instruction makes that much
difference - unless gcc is managing to discard the size of the entire
function rather than just the asm block itself.

Benchmarking on modern cpu is hard.
You really do need to lock the cpu frequencies - and that may not be supported.

David