Re: [PATCH v2] x86/asm: Use asm_inline() instead of asm() in __untagged_addr()
From: Uros Bizjak
Date: Tue Mar 18 2025 - 08:35:50 EST
On Tue, Mar 18, 2025 at 12:21 AM Ingo Molnar <mingo@xxxxxxxxxx> wrote:
>
>
> * Uros Bizjak <ubizjak@xxxxxxxxx> wrote:
>
> > Use asm_inline() to instruct the compiler that the size of asm()
> > is the minimum size of one instruction, ignoring how many instructions
> > the compiler thinks it is. ALTERNATIVE macro that expands to several
> > pseudo directives causes instruction length estimate to count
> > more than 20 instructions.
> >
> > bloat-o-meter reports minimal code size increase
> > (x86_64 defconfig with CONFIG_ADDRESS_MASKING, gcc-14.2.1):
> >
> > add/remove: 2/2 grow/shrink: 5/1 up/down: 2365/-1995 (370)
> >
> > Function old new delta
> > -----------------------------------------------------
> > do_get_mempolicy - 1449 +1449
> > copy_nodes_to_user - 226 +226
> > __x64_sys_get_mempolicy 35 213 +178
> > syscall_user_dispatch_set_config 157 332 +175
> > __ia32_sys_get_mempolicy 31 206 +175
> > set_syscall_user_dispatch 29 181 +152
> > __do_sys_mremap 2073 2083 +10
> > sp_insert 133 117 -16
> > task_set_syscall_user_dispatch 172 - -172
> > kernel_get_mempolicy 1807 - -1807
> >
> > Total: Before=21423151, After=21423521, chg +0.00%
> >
> > The code size increase is due to the compiler inlining
> > more functions that inline untagged_addr(), e.g:
> >
> > task_set_syscall_user_dispatch() is now fully inlined in
> > set_syscall_user_dispatch():
> >
> > 000000000010b7e0 <set_syscall_user_dispatch>:
> > 10b7e0: f3 0f 1e fa endbr64
> > 10b7e4: 49 89 c8 mov %rcx,%r8
> > 10b7e7: 48 89 d1 mov %rdx,%rcx
> > 10b7ea: 48 89 f2 mov %rsi,%rdx
> > 10b7ed: 48 89 fe mov %rdi,%rsi
> > 10b7f0: 65 48 8b 3d 00 00 00 mov %gs:0x0(%rip),%rdi
> > 10b7f7: 00
> > 10b7f8: e9 03 fe ff ff jmp 10b600 <task_set_syscall_user_dispatch>
>
> So this was a tail-call optimization that jumped to a standalone
> <task_set_syscall_user_dispatch>, right? So now we'll avoid the
> tail-jump and maybe a bit of the register parameter shuffling? Which
> would explain the bloatometer delta of -172 for
> task_set_syscall_user_dispatch?
Yes, this is correct. Register shuffling is part of the call setup
(because parameters have to be passed to the called function in
certain registers according to the ps-ABI). Inlining avoids this
because parameters can now be freely allocated to any register of the
required class.
> Could you also cite the first relevant bits of <task_set_syscall_user_dispatch>?
000000000010b600 <task_set_syscall_user_dispatch>:
10b600: 48 89 f8 mov %rdi,%rax
10b603: 48 85 f6 test %rsi,%rsi
10b606: 74 54 je 10b65c
<task_set_syscall_user_dispatch+0x5c>
10b608: 48 83 fe 01 cmp $0x1,%rsi
10b60c: 74 06 je 10b614
<task_set_syscall_user_dispatch+0x14>
10b60e: b8 ea ff ff ff mov $0xffffffea,%eax
10b613: c3 ret
10b614: 48 85 d2 test %rdx,%rdx
10b617: 75 7b jne 10b694
<task_set_syscall_user_dispatch+0x94>
10b619: 4d 85 c0 test %r8,%r8
10b61c: 74 1a je 10b638
<task_set_syscall_user_dispatch+0x38>
10b61e: 4c 89 c6 mov %r8,%rsi
10b621: 48 bf ef cd ab 89 67 movabs $0x123456789abcdef,%rdi
10b628: 45 23 01
10b62b: 90 nop
10b62c: 90 nop
10b62d: 90 nop
10b62e: 90 nop
10b62f: 90 nop
10b630: 90 nop
10b631: 90 nop
10b632: 90 nop
10b633: 48 39 f7 cmp %rsi,%rdi
10b636: 72 6e jb 10b6a6
<task_set_syscall_user_dispatch+0xa6>
10b638: 4c 89 80 48 08 00 00 mov %r8,0x848(%rax)
10b63f: 48 89 90 50 08 00 00 mov %rdx,0x850(%rax)
10b646: 48 89 88 58 08 00 00 mov %rcx,0x858(%rax)
10b64d: c6 80 60 08 00 00 00 movb $0x0,0x860(%rax)
10b654: f0 80 48 08 20 lock orb $0x20,0x8(%rax)
10b659: 31 c0 xor %eax,%eax
10b65b: c3 ret
10b65c: 49 09 c8 or %rcx,%r8
10b65f: 49 09 d0 or %rdx,%r8
10b662: 75 aa jne 10b60e
<task_set_syscall_user_dispatch+0xe>
10b664: 48 c7 87 48 08 00 00 movq $0x0,0x848(%rdi)
10b66b: 00 00 00 00
10b66f: 48 c7 87 50 08 00 00 movq $0x0,0x850(%rdi)
10b676: 00 00 00 00
10b67a: 48 c7 87 58 08 00 00 movq $0x0,0x858(%rdi)
10b681: 00 00 00 00
10b685: c6 87 60 08 00 00 00 movb $0x0,0x860(%rdi)
10b68c: f0 80 67 08 df lock andb $0xdf,0x8(%rdi)
10b691: 31 c0 xor %eax,%eax
10b693: c3 ret
10b694: 48 8d 34 0a lea (%rdx,%rcx,1),%rsi
10b698: 48 39 f2 cmp %rsi,%rdx
10b69b: 0f 82 78 ff ff ff jb 10b619
<task_set_syscall_user_dispatch+0x19>
10b6a1: e9 68 ff ff ff jmp 10b60e
<task_set_syscall_user_dispatch+0xe>
10b6a6: b8 f2 ff ff ff mov $0xfffffff2,%eax
10b6ab: c3 ret
> I don't seem to be able to reproduce this inlining decision, my version
> of GCC is:
>
> gcc version 14.2.0 (Ubuntu 14.2.0-4ubuntu2)
>
> which is one patch version older than your 14.2.1.
>
> I tried GCC 11, 12, 13 as well, but none did this tail optimization,
> all appear to be inlining <task_set_syscall_user_dispatch> into
> <set_syscall_user_dispatch>. What am I missing?
CONFIG_ADDRESS_MASKING in current -tip depends on:
Depends on: X86_64 [=y] && (COMPILE_TEST [=n] || !CPU_MITIGATIONS [=y])
and I chose to disable CPU_MITIGATIONS (this is the reason no __pfx
functions are reported and the compiler is able to do sibling call
optimization).
BTW: I found the cited functions a representation of how the compiler
inlines bigger functions into smaller ones (one would expect the
opposite) depending on and respecting the code size growth factor
criteria for combined function.
> Another question, where do the size increases in these functions come
> from:
>
> > __x64_sys_get_mempolicy 35 213 +178
This one now inlines kernel_get_mempolicy().
> > syscall_user_dispatch_set_config 157 332 +175
This one now inlines task_set_syscall_user_dispatch().
> > __ia32_sys_get_mempolicy 31 206 +175
This one now inlines kernel_get_mempolicy().
> > set_syscall_user_dispatch 29 181 +152
This one now inlines task_set_syscall_user_dispatch().
> (I have to ask, because I have trouble reproducing with my toolchain so
> I cannot look at this myself.)
BTW: the indication of inline changes w.r.t. __untagged_addr() is the
number of references to the tlbstate_untag_mask variable in the
.altinstr_replacement section. If these go up, more inlining happens.
Also, bloat-o-meter indicates which function changes, and looking for
a string of 8 NOPs in the function will show where the AND instruction
is to be patched in.
Uros.