Re: [PATCH] x86/crc32: optimize tail handling for crc32c short inputs
From: David Laight
Date: Wed Mar 05 2025 - 09:27:42 EST
On Tue, 4 Mar 2025 13:32:16 -0800
Eric Biggers <ebiggers@xxxxxxxxxx> wrote:
> From: Eric Biggers <ebiggers@xxxxxxxxxx>
>
> For handling the 0 <= len < sizeof(unsigned long) bytes left at the end,
> do a 4-2-1 step-down instead of a byte-at-a-time loop. This allows
> taking advantage of wider CRC instructions. Note that crc32c-3way.S
> already uses this same optimization too.
An alternative is to add extra zero bytes at the start of the buffer.
They don't affect the crc and just need the first 8 bytes shifted left.
I think any non-zero 'crc-in' just needs to be xor'ed over the first
4 actual data bytes.
(It's over 40 years since I did the maths of CRC.)
You won't notice the misaligned accesses all down the buffer.
When I was testing different ipcsum code misaligned buffers
cost less than 1 clock per cache line.
I think that was even true for the versions that managed 12 bytes
per clock (including the one Linus committed).
David