Re: [PATCH v3] lib/crc: arm64: add NEON accelerated CRC64-NVMe implementation
From: David Laight
Date: Mon Mar 30 2026 - 05:36:30 EST
On Sun, 29 Mar 2026 15:18:21 -0700
Eric Biggers <ebiggers@xxxxxxxxxx> wrote:
> On Sun, Mar 29, 2026 at 10:57:04PM +0100, David Laight wrote:
> > Final thought:
> > Is that allowing for the cost of kernel_fpu_begin()? - which I think only
> > affects the first call.
> > And the cost of the data-cache misses for the lookup table reads? - again
> > worse for the first call.
>
> I assume you mean kernel_neon_begin(). This is an arm64 patch.
Well, much the same.
> (I encourage you to actually read the code. You seem to send a lot of
> speculation-heavy comments without actually reading the code.)
I have looked at the code, since I (mostly) understand the maths I can
almost work out what it is doing - but all the conversions between three
different ways of holding two 64bit values in one 128bit register really
don't help.
> Currently, the benchmark in crc_kunit just measures the throughput in a
> loop (as has been discussed before). So no, it doesn't currently
> capture the overhead of pulling code and data into cache. For NEON
> register use it captures only the amortized overhead.
>
> Note that using PMULL saves having to pull the table into memory, while
> using the table is a bit less code and saves having to use kernel-mode
> NEON. So both have their advantages and disadvantages.
Indeed - so the 128 is really a 'finger in the air' value :-)
> This patch does fall back to the table for the last 'len & ~15' bytes,
> which means the table may be needed anyway.
Nibble lookups on two separate tables (256 bytes instead of 2k) might
be almost as fast even with the tables in the cache.
The critical part of the table lookup loop should be the:
crc = crc ^ table[crc & 0xff]
part (the rotate should get hidden in the memory read latency).
With nibble tables is becomes:
crc = crc ^ table_lo[crc & 0xf] ^ table_hi[(crc & 0xf0) >> 4]
on any modern cpu the table lookups will happen in parallel; so it
should just add one 'xor' to the loop.
(And yes, I probably could measure it, at least in userspace on x86-64.)
> That is not the optimal way
> to do it, and it's something to address later when this is replaced with
> something similar to x86's crc-pclmul-template.S.
That is one bit I do need to grok...
David
>
> - Eric