Re: [PATCH v2] x86/crc32: use builtins to improve code generation

From: Bill Wendling
Date: Mon Mar 03 2025 - 15:27:51 EST


On Mon, Mar 3, 2025 at 12:15 PM David Laight
<david.laight.linux@xxxxxxxxx> wrote:
> On Thu, 27 Feb 2025 15:47:03 -0800
> Bill Wendling <morbo@xxxxxxxxxx> wrote:
>
> > For both gcc and clang, crc32 builtins generate better code than the
> > inline asm. GCC improves, removing unneeded "mov" instructions. Clang
> > does the same and unrolls the loops. GCC has no changes on i386, but
> > Clang's code generation is vastly improved, due to Clang's "rm"
> > constraint issue.
> >
> > The number of cycles improved by ~0.1% for GCC and ~1% for Clang, which
> > is expected because of the "rm" issue. However, Clang's performance is
> > better than GCC's by ~1.5%, most likely due to loop unrolling.
>
> How much does it unroll?
> How much you need depends on the latency of the crc32 instruction.
> The copy of Agner's tables I have gives it a latency of 3 on
> pretty much everything.
> If you can only do one chained crc instruction every three clocks
> it is hard to see how unrolling the loop will help.
> Intel cpu (since sandy bridge) will run a two clock loop.
> With three clocks to play with it should be easy (even for a compiler)
> to generate a loop with no extra clock stalls.
>
> Clearly if Clang decides to copy arguments to the stack an extra time
> that will kill things. But in this case you want the "m" constraint
> to directly read from the buffer (with a (reg,reg,8) addressing mode).
>
Below is what Clang generates with the builtins. From what Eric said,
this code is only run for sizes <= 512 bytes? So maybe it's not super
important to micro-optimize this. I apologize, but my ability to
measure clock loops for x86 code isn't great. (I'm sure I lack the
requisite benchmarks, etc.)

-bw

.LBB1_9: # =>This Inner Loop Header: Depth=1
movl %ebx, %ebx
crc32q (%rcx), %rbx
addq $8, %rcx
incq %rdi
cmpq %rdi, %rsi
jne .LBB1_9
# %bb.10:
subq %rdi, %rax
jmp .LBB1_11
.LBB1_7:
movq %r14, %rcx
.LBB1_11:
movq %r15, %rsi
andq $-8, %rsi
cmpq $7, %rdx
jb .LBB1_14
# %bb.12:
xorl %edx, %edx
.LBB1_13: # =>This Inner Loop Header: Depth=1
movl %ebx, %ebx
crc32q (%rcx,%rdx,8), %rbx
crc32q 8(%rcx,%rdx,8), %rbx
crc32q 16(%rcx,%rdx,8), %rbx
crc32q 24(%rcx,%rdx,8), %rbx
crc32q 32(%rcx,%rdx,8), %rbx
crc32q 40(%rcx,%rdx,8), %rbx
crc32q 48(%rcx,%rdx,8), %rbx
crc32q 56(%rcx,%rdx,8), %rbx
addq $8, %rdx
cmpq %rdx, %rax
jne .LBB1_13
.LBB1_14:
addq %rsi, %r14
.LBB1_15:
andq $7, %r15
je .LBB1_23
# %bb.16:
crc32b (%r14), %ebx
cmpl $1, %r15d
je .LBB1_23
# %bb.17:
crc32b 1(%r14), %ebx
cmpl $2, %r15d
je .LBB1_23
# %bb.18:
crc32b 2(%r14), %ebx
cmpl $3, %r15d
je .LBB1_23
# %bb.19:
crc32b 3(%r14), %ebx
cmpl $4, %r15d
je .LBB1_23
# %bb.20:
crc32b 4(%r14), %ebx
cmpl $5, %r15d
je .LBB1_23
# %bb.21:
crc32b 5(%r14), %ebx
cmpl $6, %r15d
je .LBB1_23
# %bb.22:
crc32b 6(%r14), %ebx
.LBB1_23:
movl %ebx, %eax
.LBB1_24: