Re: [PATCH v2] x86/crc32: use builtins to improve code generation

From: H. Peter Anvin
Date: Mon Mar 03 2025 - 18:58:44 EST


On March 3, 2025 2:42:16 PM PST, David Laight <david.laight.linux@xxxxxxxxx> wrote:
>On Mon, 3 Mar 2025 12:27:21 -0800
>Bill Wendling <morbo@xxxxxxxxxx> wrote:
>
>> On Mon, Mar 3, 2025 at 12:15 PM David Laight
>> <david.laight.linux@xxxxxxxxx> wrote:
>> > On Thu, 27 Feb 2025 15:47:03 -0800
>> > Bill Wendling <morbo@xxxxxxxxxx> wrote:
>> >
>> > > For both gcc and clang, crc32 builtins generate better code than the
>> > > inline asm. GCC improves, removing unneeded "mov" instructions. Clang
>> > > does the same and unrolls the loops. GCC has no changes on i386, but
>> > > Clang's code generation is vastly improved, due to Clang's "rm"
>> > > constraint issue.
>> > >
>> > > The number of cycles improved by ~0.1% for GCC and ~1% for Clang, which
>> > > is expected because of the "rm" issue. However, Clang's performance is
>> > > better than GCC's by ~1.5%, most likely due to loop unrolling.
>> >
>> > How much does it unroll?
>> > How much you need depends on the latency of the crc32 instruction.
>> > The copy of Agner's tables I have gives it a latency of 3 on
>> > pretty much everything.
>> > If you can only do one chained crc instruction every three clocks
>> > it is hard to see how unrolling the loop will help.
>> > Intel cpu (since sandy bridge) will run a two clock loop.
>> > With three clocks to play with it should be easy (even for a compiler)
>> > to generate a loop with no extra clock stalls.
>> >
>> > Clearly if Clang decides to copy arguments to the stack an extra time
>> > that will kill things. But in this case you want the "m" constraint
>> > to directly read from the buffer (with a (reg,reg,8) addressing mode).
>> >
>> Below is what Clang generates with the builtins. From what Eric said,
>> this code is only run for sizes <= 512 bytes? So maybe it's not super
>> important to micro-optimize this. I apologize, but my ability to
>> measure clock loops for x86 code isn't great. (I'm sure I lack the
>> requisite benchmarks, etc.)
>
>Jeepers - that is trashing the I-cache.
>Not to mention all the conditional branches at the bottom.
>Consider the basic loop:
>1: crc32q (%rcx), %rbx
> addq $8, %rcx
> cmp %rcx, %rdx
> jne 1b
>The crc32 has latency 3 so it must take at least 3 clocks.
>Even naively the addq can be issued in the same clock as the crc32
>and the cmp and jne in the following ones.
>Since the jne is predicted taken, the addq can be assumed to execute
>in the same clock as the jne.
>(The cmp+jne might also get merged into a single u-op)
>(I've done this with adc (for IP checksum), with two adc the loop takes
>two clocks even with the extra memory reads.)
>
>So that loop is likely to run limited by the three clock latency of crc32.
>Even the memory reads will happen with all the crc32 just waiting for the
>previous crc32 to finish.
>You can take an instruction out of the loop:
>1: crc32q (%rcx,%rdx), %rbx
> addq $8, %rdx
> jne 1b
>but that may not be necessary, and (IIRC) gcc doesn't like letting you
>generate it.
>
>For buffers that aren't multiples of 8 bytes 'remember' that the crc of
>a byte depends on how far it is from the end of the buffer, and that initial
>zero bytes have no effect.
>So (provided the buffer is 8+ bytes long) read the first 8 bytes, shift
>right by the number of bytes needed to make the rest of the buffer a multiple
>or 8 bytes (the same as reading from across the start of the buffer and masking
>the low bytes) then treat exactly the same as a buffer that is a multiple
>of 8 bytes long.
>Don't worry about misaligned reads, you lose less than one clock per cache
>line (that is with adc doing a read every clock).
>
>Actually measuring the performance is hard.
>You can use rdtsc because the clock speed will change when the cpu gets busy.
>There is a 'performance counter' that is actual clocks.
>While you can use the library functions to set it up, you need to just read the
>register - the library overhead it too big.
>You also need the odd lfence.
>Having done that, and provided the buffer is in the L1 d-cache you can measure
>the loop time in clocks and compare against the expected value.
>Once you've got 3 clocks per crc32 instruction it won't get any better,
>which is why the 'fast' code for big buffers does crc of 3+ buffers sections
>in parallel.
>
> David
>
>>
>> -bw
>>
>> .LBB1_9: # =>This Inner Loop Header: Depth=1
>> movl %ebx, %ebx
>> crc32q (%rcx), %rbx
>> addq $8, %rcx
>> incq %rdi
>> cmpq %rdi, %rsi
>> jne .LBB1_9
>> # %bb.10:
>> subq %rdi, %rax
>> jmp .LBB1_11
>> .LBB1_7:
>> movq %r14, %rcx
>> .LBB1_11:
>> movq %r15, %rsi
>> andq $-8, %rsi
>> cmpq $7, %rdx
>> jb .LBB1_14
>> # %bb.12:
>> xorl %edx, %edx
>> .LBB1_13: # =>This Inner Loop Header: Depth=1
>> movl %ebx, %ebx
>> crc32q (%rcx,%rdx,8), %rbx
>> crc32q 8(%rcx,%rdx,8), %rbx
>> crc32q 16(%rcx,%rdx,8), %rbx
>> crc32q 24(%rcx,%rdx,8), %rbx
>> crc32q 32(%rcx,%rdx,8), %rbx
>> crc32q 40(%rcx,%rdx,8), %rbx
>> crc32q 48(%rcx,%rdx,8), %rbx
>> crc32q 56(%rcx,%rdx,8), %rbx
>> addq $8, %rdx
>> cmpq %rdx, %rax
>> jne .LBB1_13
>> .LBB1_14:
>> addq %rsi, %r14
>> .LBB1_15:
>> andq $7, %r15
>> je .LBB1_23
>> # %bb.16:
>> crc32b (%r14), %ebx
>> cmpl $1, %r15d
>> je .LBB1_23
>> # %bb.17:
>> crc32b 1(%r14), %ebx
>> cmpl $2, %r15d
>> je .LBB1_23
>> # %bb.18:
>> crc32b 2(%r14), %ebx
>> cmpl $3, %r15d
>> je .LBB1_23
>> # %bb.19:
>> crc32b 3(%r14), %ebx
>> cmpl $4, %r15d
>> je .LBB1_23
>> # %bb.20:
>> crc32b 4(%r14), %ebx
>> cmpl $5, %r15d
>> je .LBB1_23
>> # %bb.21:
>> crc32b 5(%r14), %ebx
>> cmpl $6, %r15d
>> je .LBB1_23
>> # %bb.22:
>> crc32b 6(%r14), %ebx
>> .LBB1_23:
>> movl %ebx, %eax
>> .LBB1_24:
>
>

The tail is *weird*. Wouldn't it be better to do a 4-2-1 stepdown?