Re: [PATCH -tip] x86/locking/atomic: Use asm_inline for atomic locking insns
From: Ingo Molnar
Date: Thu Mar 06 2025 - 05:00:31 EST
* Uros Bizjak <ubizjak@xxxxxxxxx> wrote:
> According to:
>
> https://gcc.gnu.org/onlinedocs/gcc/Size-of-an-asm.html
>
> the usage of asm pseudo directives in the asm template can confuse
> the compiler to wrongly estimate the size of the generated
> code.
>
> The LOCK_PREFIX macro expands to several asm pseudo directives, so
> its usage in atomic locking insns causes instruction length estimate
> to fail significantly (the specially instrumented compiler reports
> the estimated length of these asm templates to be 6 instructions long).
>
> This incorrect estimate further causes unoptimal inlining decisions,
> unoptimal instruction scheduling and unoptimal code block alignments
> for functions that use these locking primitives.
>
> Use asm_inline instead:
>
> https://gcc.gnu.org/pipermail/gcc-patches/2018-December/512349.html
>
> which is a feature that makes GCC pretend some inline assembler code
> is tiny (while it would think it is huge), instead of just asm.
>
> For code size estimation, the size of the asm is then taken as
> the minimum size of one instruction, ignoring how many instructions
> compiler thinks it is.
>
> The code size of the resulting x86_64 defconfig object file increases
> for 33.264 kbytes, representing 1.2% code size increase:
>
> text data bss dec hex filename
> 27450107 4633332 814148 32897587 1f5fa33 vmlinux-old.o
> 27483371 4633784 814148 32931303 1f67de7 vmlinux-new.o
>
> mainly due to different inlining decisions of -O2 build.
So my request here would be not more benchmark figures (I don't think
it's a realistic expectation for contributors to be able to measure
much of an effect with such a type of change, let alone be certain
what a macro or micro-benchmark measures is causally connected with
the patch), but I'd like to ask for some qualitative analysis on the
code generation side:
- +1.2% code size increase is a lot, especially if it's under the
default build flags of the kernel. Where does the extra code come
from?
- Is there any effect on Clang? Are its inlining decisions around
these asm() statements comparable, worse/better?
A couple of concrete examples would go a long way:
- "Function XXX was inlined 3 times before the patch, and it was
inlined 30 times after the patch. I have reviewed two such inlining
locations, and they have added more code to unlikely or
failure-handling branches collected near the function epilogue,
while the fast-path of the function was more optimal."
Or you might end up finding:
- "Function YYY was inlined 3x more frequently after the patch, but
the inlining decision increased register pressure and created less
optimal code in the fast-path, increasing both code size and likely
decreasing fast-path performance."
Obviously we'd be sad about the second case, but it's well within the
spectrum of possibilities when we look at "+1.2% object code size
increase".
What we cannot do is to throw up our hands and claim "-O2 trades
performance for size, and thus this patch improves performance".
We don't know that for sure and 30 years of kernel development
created a love-and-hate relationship and a fair level of distrust
between kernel developers and compiler inlining decisions,
especially around x86 asm() statements ...
So these are roughly the high level requirements around such patches.
Does this make sense?
Thanks,
Ingo