Re: [PATCH v2] x86/asm: Use asm_inline() instead of asm() in __untagged_addr()
From: Ingo Molnar
Date: Mon Mar 17 2025 - 19:21:18 EST
* Uros Bizjak <ubizjak@xxxxxxxxx> wrote:
> Use asm_inline() to instruct the compiler that the size of asm()
> is the minimum size of one instruction, ignoring how many instructions
> the compiler thinks it is. ALTERNATIVE macro that expands to several
> pseudo directives causes instruction length estimate to count
> more than 20 instructions.
>
> bloat-o-meter reports minimal code size increase
> (x86_64 defconfig with CONFIG_ADDRESS_MASKING, gcc-14.2.1):
>
> add/remove: 2/2 grow/shrink: 5/1 up/down: 2365/-1995 (370)
>
> Function old new delta
> -----------------------------------------------------
> do_get_mempolicy - 1449 +1449
> copy_nodes_to_user - 226 +226
> __x64_sys_get_mempolicy 35 213 +178
> syscall_user_dispatch_set_config 157 332 +175
> __ia32_sys_get_mempolicy 31 206 +175
> set_syscall_user_dispatch 29 181 +152
> __do_sys_mremap 2073 2083 +10
> sp_insert 133 117 -16
> task_set_syscall_user_dispatch 172 - -172
> kernel_get_mempolicy 1807 - -1807
>
> Total: Before=21423151, After=21423521, chg +0.00%
>
> The code size increase is due to the compiler inlining
> more functions that inline untagged_addr(), e.g:
>
> task_set_syscall_user_dispatch() is now fully inlined in
> set_syscall_user_dispatch():
>
> 000000000010b7e0 <set_syscall_user_dispatch>:
> 10b7e0: f3 0f 1e fa endbr64
> 10b7e4: 49 89 c8 mov %rcx,%r8
> 10b7e7: 48 89 d1 mov %rdx,%rcx
> 10b7ea: 48 89 f2 mov %rsi,%rdx
> 10b7ed: 48 89 fe mov %rdi,%rsi
> 10b7f0: 65 48 8b 3d 00 00 00 mov %gs:0x0(%rip),%rdi
> 10b7f7: 00
> 10b7f8: e9 03 fe ff ff jmp 10b600 <task_set_syscall_user_dispatch>
So this was a tail-call optimization that jumped to a standalone
<task_set_syscall_user_dispatch>, right? So now we'll avoid the
tail-jump and maybe a bit of the register parameter shuffling? Which
would explain the bloatometer delta of -172 for
task_set_syscall_user_dispatch?
Could you also cite the first relevant bits of <task_set_syscall_user_dispatch>?
I don't seem to be able to reproduce this inlining decision, my version
of GCC is:
gcc version 14.2.0 (Ubuntu 14.2.0-4ubuntu2)
which is one patch version older than your 14.2.1.
I tried GCC 11, 12, 13 as well, but none did this tail optimization,
all appear to be inlining <task_set_syscall_user_dispatch> into
<set_syscall_user_dispatch>. What am I missing?
Another question, where do the size increases in these functions come
from:
> __x64_sys_get_mempolicy 35 213 +178
> syscall_user_dispatch_set_config 157 332 +175
> __ia32_sys_get_mempolicy 31 206 +175
> set_syscall_user_dispatch 29 181 +152
(I have to ask, because I have trouble reproducing with my toolchain so
I cannot look at this myself.)
Thanks,
Ingo