Re: [RFC PATCH 2/2] treewide, bits: use ffs_val() where it is open-coded

From: Maciej W. Rozycki

Date: Mon Jan 12 2026 - 06:22:01 EST


On Sun, 11 Jan 2026, David Laight wrote:

> > If you have a population count instruction such as CTPOP on EV67 Alpha,
> > you can do it in two instructions, such as:
> >
> > ctpop $16, $0
> > cmpeq $0, 1, $0
> >
> > so there seems to be room for improvement here, but GCC stubbornly refuses
> > to produce this instruction sequence for this code:
> >
> > bool is_power_of_2(unsigned long n)
> > {
> > return __builtin_popcountl(n) == 1;
> > }
> >
> > apparently owing to no insn cost set for the POPCOUNT RTX in the backend
> > causing the middle end to pick up some default. Instead GCC comes up with
> > what can be coded in C as:
> >
> > bool is_power_of_2(unsigned long n)
> > {
> > return n - 1 < (n ^ (n - 1));
> > }
>
> Not seen that one - and not thought of popcnt, far too new for me.
> (I learnt asm for a pdp11 first...)

It does show up as an RTL operation in the compiler; it's a good way to
discover stuff.

> > which seems an interesting alternative that produces 3 instructions only
> > across Alpha, MIPS and RISC-V targets. It's 5 instructions on POWER and
> > x86-64 vs 8 and 9 respectively with our current implementation. There are
> > no branches produced here, although an inline expansion will likely end
> > with one.
>
> I'm seeing:
> is_power_of_2:
> leaq -1(%rdi), %rax
> xorq %rax, %rdi
> cmpq %rdi, %rax
> setb %al
> retq
> on x86-64 - which is really 3 instructions.

Ack, this is `unsigned long' vs `bool' return type. I'm just too used to
RISC psABIs. FWIW I have this:

leaq -1(%rdi), %rax
xorq %rax, %rdi
cmpq %rdi, %rax
setb %al
movzbl %al, %eax
ret

in the former case and I take it `bool' is 8-bit on x86.

> While using the popcnt instruction would reduce it by one, it will only
> be faster on amd zen cpu, on intel cpu popcnt has a latency of 3 and only
> executes on p1 (according to Agner).

Ack. I've lost track of x86 after ~Pentium Pro.

> > > I suspect there may not be one, but all these 'tricks' come from the 1970s
> > > (or earlier) and don't allow for the behaviour of modern cpu with multiple
> > > execution units and long pipelines.
> >
> > As we can see you never know. I've sent a code update proposal.
>
> I'd suggest moving is_power_of_2() to bits.h as:
> #define is_power_of_2(n) ({ auto _n = n; _n - 1 < _n ^ (_n - 1); })
> and add:
> #define is_zero_or_power_of_2(n) ({ auto _n = n; _n & (_n - 1) != 0; })

I'll leave it as an exercise for someone else.

> > The use of the population count operation can be considered separately,
> > but we'd have to be careful here as the quality of code produced by the
> > intrinsic varies, e.g. with pre-EV67 Alpha, RISC-V and VAX (!) GCC emits
> > code equivalent to my proposal, but with MIPS or x86-64 it resorts to a
> > libcall, so a guarding condition would be required. So I'd leave it for
> > another occasion.
>
> I'd guess that many of those popcnt instructions aren't actually as fast
> as the dec/xor pair.

With EV67 Alpha and POWER being proper RISC architectures I'd expect the
instruction to be implemented using a dedicated logic gate network. After
all the population count operation is just an n-operand addition of 1-bit
inputs, so not a big deal. The propagation delay is probably on the same
order of magnitude as with a 2-operand n-bit adder, so the operation ought
to fit in a single pipeline stage.

I suppose it's just a matter of whether you want to dedicate a piece of
silicon just for this somewhat exotic operation, and in the old days the
answer would be no for a general-purpose CPU owing to manufacturing cost
and die size limitation, while nowadays with process shrinkage and power
consumption reduction it may have become worthwhile.

> But the libcall is just plain stupid.

It could just be that it's a recent addition as my compilation of GCC for
x86-64 and MIPS turns out to be version 11 and 13 respectively, while I
have version 15 for the remaining targets. Indeed when I tried now GCC 14
that I have at hand for another MIPS configuration the libcall is gone, so
the optimisation must have been added in that version.

Of course you still need a libcall for a plain `__builtin_popcountl' call
where there's no suitable hardware instruction available.

Maciej