Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option

From: Ingo Molnar
Date: Sun Mar 30 2025 - 14:54:27 EST



* Uros Bizjak <ubizjak@xxxxxxxxx> wrote:

> On Sun, Mar 30, 2025 at 11:56 AM Ingo Molnar <mingo@xxxxxxxxxx> wrote:
> >
> >
> > * Uros Bizjak <ubizjak@xxxxxxxxx> wrote:
> >
> > > > So a better optimization I think would be to declare and implement
> > > > __sw_hweight32 with a different, less intrusive function call ABI
> > > > that
> > >
> > > With an external function, the ABI specifies the location of input
> > > argument and function result.
> >
> > This is all within the kernel, and __sw_hweight32() is implemented in
> > the kernel as well, entirely in assembly, and the ALTERNATIVE*() macros
> > are fully under our control as well - so we have full control over the
> > calling convention.
>
> There is a minor issue with a generic prototype in <linux/bitops.h>,
> where we declare:
>
> extern unsigned int __sw_hweight32(unsigned int w);
> extern unsigned long __sw_hweight64(__u64 w);
>
> This creates a bit of mixup, so perhaps it is better to define and use
> an x86 specific function name.

Yes, I alluded to this complication:

> > For example, we could make a version of __sw_hweight32 that is a
> > largely no-clobber function that only touches a single register, which

That version of __sw_hweight32 would be a different symbol.

> > I'm not saying it's *worth* it for POPCNTL emulation alone:
> >
> > - The code generation benefits might or might not be there. Needs to
> > be examined.
>
> Matching inputs with output will actually make the instruction
> "destructive", so the compiler will have to copy the input argument
> when it won't die in the instruction. This is not desirable.

Yeah, absolutely - it was mainly a demonstration that even
single-clobber functions are possible. (There's even zero-clobber
functions, like __fentry__)

> I think that adding a __POPCNT__ version (similar to my original
> patch) would bring the most benefit, because we could use "rm" input
> and "=r" output registers, without any constraints, enforced by
> fallback function call. This is only possible with a new
> -march=native functionality.

Yeah, -march=native might be nice for local tinkering, but it won't
reach 99.999% of Linux users - so it's immaterial to this particular
discussion.

Also, is POPCNTL the best example for this? Are there no other, more
frequently used ALTERNATIVE() patching sites with function call
alternatives that disturb the register state of important kernel
functions? (And I don't know the answer.)

Thanks,

Ingo