Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option

From: Ingo Molnar
Date: Tue Mar 25 2025 - 17:56:46 EST



* Uros Bizjak <ubizjak@xxxxxxxxx> wrote:

> Emit naked POPCNT instruction when available with X86_NATIVE_CPU
> option. The compiler is not bound by ABI when emitting the instruction
> without the fallback call to __sw_hweight{32,64}() library function
> and has much more freedom to allocate input and output operands,
> including memory input operand.
>
> The code size of x86_64 defconfig (with X86_NATIVE_CPU option)
> shrinks by 599 bytes:
>
> add/remove: 0/0 grow/shrink: 45/197 up/down: 843/-1442 (-599)
> Total: Before=22710531, After=22709932, chg -0.00%
>
> The asm changes from e.g.:
>
> 3bf9c: 48 8b 3d 00 00 00 00 mov 0x0(%rip),%rdi
> 3bfa3: e8 00 00 00 00 call 3bfa8 <...>
> 3bfa8: 90 nop
> 3bfa9: 90 nop
>
> with:
>
> 34b: 31 c0 xor %eax,%eax
> 34d: f3 48 0f b8 c7 popcnt %rdi,%rax
>
> in the .altinstr_replacement section
>
> to:
>
> 3bfdc: 31 c0 xor %eax,%eax
> 3bfde: f3 48 0f b8 05 00 00 popcnt 0x0(%rip),%rax
> 3bfe5: 00 00
>
> where there is no need for an entry in the .altinstr_replacement
> section, shrinking all text sections by 9476 bytes:
>
> text data bss dec hex filename
> 27267068 4643047 814852 32724967 1f357e7 vmlinux-old.o
> 27257592 4643047 814852 32715491 1f332e3 vmlinux-new.o

> +#ifdef __POPCNT__
> + asm_inline (ASM_FORCE_CLR "popcntl %[val], %[cnt]"
> + : [cnt] "=&r" (res)
> + : [val] ASM_INPUT_RM (w));
> +#else
> asm_inline (ALTERNATIVE(ANNOTATE_IGNORE_ALTERNATIVE
> "call __sw_hweight32",
> ASM_CLR "popcntl %[val], %[cnt]",
> X86_FEATURE_POPCNT)
> : [cnt] "=a" (res), ASM_CALL_CONSTRAINT
> : [val] REG_IN (w));

So a better optimization I think would be to declare and implement
__sw_hweight32 with a different, less intrusive function call ABI that
mirrors that of the instruction in essence, so that we optimize for the
overwhelmingly common case of having the POPCNT instruction.

Thanks,

Ingo