Re: [PATCH] x86/asm: Use asm_inline() instead of asm() in __untagged_addr()
From: H. Peter Anvin
Date: Mon Mar 17 2025 - 14:35:36 EST
On March 17, 2025 2:01:12 AM PDT, Ingo Molnar <mingo@xxxxxxxxxx> wrote:
>
>* Borislav Petkov <bp@xxxxxxxxx> wrote:
>
>> On Fri, Mar 14, 2025 at 10:30:55AM +0100, Uros Bizjak wrote:
>> > Use asm_inline() to instruct the compiler that the size of asm()
>> > is the minimum size of one instruction, ignoring how many instructions
>> > the compiler thinks it is. ALTERNATIVE macro that expands to several
>> > pseudo directives causes instruction length estimate to count
>> > more than 20 instructions.
>> >
>> > bloat-o-meter reports minimal code size increase
>>
>> If you see an increase and *no* *other* *palpable* improvement, you
>> don't send it. It is that simple.
>
>Sorry, but you wouldn't be saying that eliminating function calls is
>not a 'palpable improvement', had you ever profiled a recent kernel on
>a real system, on modern CPUs ... :-/
>
>The sad reality is that the top profile is dominated by function call +
>return overhead due to CPU bug mitigation workarounds that create per
>function call overhead:
>
> Overhead Shared Object Symbol
> 4.57% [kernel] [k] retbleed_return_thunk <============= !!!!!!!!
> 4.40% [kernel] [k] unmap_page_range
> 4.31% [kernel] [k] _copy_to_iter
> 2.46% [kernel] [k] memset_orig
> 2.31% libc.so.6 [.] __cxa_finalize
>
>That retbleed_return_thunk overhead gets avoided every time we inline a
>simple enough function.
>
>But GCC cannot always do proper inlining decisions due to our
>complicated ALTERNATIVE macro constructs confusing the GCC inliner:
>
> > > ALTERNATIVE macro that expands to several pseudo directives causes
> > > instruction length estimate to count more than 20 instructions.
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
>Note how the asm_inline() compiler feature was added by GCC at the
>kernel community's request to address such issues. (!)
>
>So for those reasons, in my book, eliminating a function call for
>really simple single instruction inlines is an unconditional
>improvement that doesn't require futile performance measurements - it
>'only' requires assembly level code generation analysis in the
>changelog.
>
>The reason is that requiring measurable effects for really small
>inlining changes is pretty much impossible in practice. I know, because
>I tried, and I'm good at measuring such things and I have the hardware
>to do it. Yet the per function call overhead demonstrated above in the
>profile is very much real and should not be handwaved away.
>
>Note that this policy doesn't apply to other inlining decisions, only
>to single-instruction inline functions.
>
>Also, having said all that, for this particular patch I'd still like to
>see a bit more GCC code generation analysis in this particular
>changelog: could you please cite a single relevant, representative
>example before/after assembly code section that demonstrates the
>effects of the inlined asm versus function call version, including the
>function that gets called?
>
>I'm asking for that because sometimes single instructions can still
>have a halo of half a dozen of instructions that set them up or
>transform their results, so sometimes having a function call is the
>better option. Not all single-instruction asm() statements are 'simple'
>in praxis - but looking at the code generation will very much tell us
>whether it is.
>
>Thanks,
>
> Ingo
I would like to repeat that I would like to see us at least try to #define asm __asm__ __inline__ tree-wide (with a possible opt-out) and run a benchmark on it. Since this is a central knob, we could even make it a Kconfig option that architectures can opt in or out of, or be overridden for specific compilers should it ever be necessary.
It is simply much closer to how we actually use asm() in the Linux kernel, *and* what performance characteristics we tend to care about. More often than not if we have a large hunk of assembly source it is because of metadata and/or directives.
It doesn't hurt that inline duplicating kernel code can occasionally bring about huge improvements in terms of branch eliminations because often (but far from always, of course) the difference in call context allows the compiler to eliminate dead paths.
-hpa