Re: [PATCH 2/2] x86/bitops: Fix false output register dependency of TZCNT insn

From: Uros Bizjak
Date: Wed Mar 26 2025 - 04:46:37 EST


On Tue, Mar 25, 2025 at 10:43 PM Ingo Molnar <mingo@xxxxxxxxxx> wrote:
>
>
> * Uros Bizjak <ubizjak@xxxxxxxxx> wrote:
>
> > On Haswell and later Intel processors, the TZCNT instruction appears
> > to have a false dependency on the destination register. Even though
> > the instruction only writes to it, the instruction will wait until
> > destination is ready before executing. This false dependency
> > was fixed for Skylake (and later) processors.
> >
> > Fix false dependency by clearing the destination register first.
> >
> > The x86_64 defconfig object size increases by 4215 bytes:
> >
> > text data bss dec hex filename
> > 27342396 4642999 814852 32800247 1f47df7 vmlinux-old.o
> > 27346611 4643015 814852 32804478 1f48e7e vmlinux-new.o
>
> Yeah, so Skylake was released in 2015, about a decade ago.
>
> So we'd be making the kernel larger for an unquantified
> micro-optimization for CPUs that almost nobody uses anymore.
> That's a bad trade-off.

Yes, 4.2k seems a bit excessive. OTOH, I'd not say that the issue is a
micro-optimization, it is bordering on the hardware bug.

An alternative (also answering Peter's question) would be to make
instruction dependent only on the input (so, matching input and output
operand, making insn destructive), requiring MOV if the input operand
is to be reused. A quick test (x86_64 defconfig) with the attached
patch shows much less impact:

text data bss dec hex filename
27564120 4642423 814852 33021395 1f7ddd3 vmlinux-old.o
27564679 4642423 814852 33021954 1f7e002 vmlinux-new.o

The increase is now 559 bytes, which IMHO is an acceptable trade-off.
Please also note that:

asm("tzcnt %1,%0"
: "=r" (word)
: "r" (~word));
return word;

emits NOT insn that is also destructive and already requires copying
of the input operand if the operand is to be reused.

Thanks,
Uros.
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index bbaf75ea6703..378194687bec 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -250,7 +250,7 @@ static __always_inline unsigned long variable__ffs(unsigned long word)
{
asm("tzcnt %1,%0"
: "=r" (word)
- : ASM_INPUT_RM (word));
+ : "0" (word)); /* avoid false output dependency */
return word;
}

@@ -267,10 +267,7 @@ static __always_inline unsigned long variable__ffs(unsigned long word)

static __always_inline unsigned long variable_ffz(unsigned long word)
{
- asm("tzcnt %1,%0"
- : "=r" (word)
- : "r" (~word));
- return word;
+ return variable__ffs(~word);
}

/**