Re: [PATCH v3] lib/crc: arm64: add NEON accelerated CRC64-NVMe implementation

From: David Laight

Date: Sun Mar 29 2026 - 17:57:18 EST


On Sun, 29 Mar 2026 13:38:29 -0700
Eric Biggers <ebiggers@xxxxxxxxxx> wrote:

> On Sun, Mar 29, 2026 at 07:43:38AM +0000, Demian Shulhan wrote:
> > Implement an optimized CRC64 (NVMe) algorithm for ARM64 using NEON
> > Polynomial Multiply Long (PMULL) instructions. The generic shift-and-XOR
> > software implementation is slow, which creates a bottleneck in NVMe and
> > other storage subsystems.
> >
> > The acceleration is implemented using C intrinsics (<arm_neon.h>) rather
> > than raw assembly for better readability and maintainability.
> >
> > Key highlights of this implementation:
> > - Uses 4KB chunking inside scoped_ksimd() to avoid preemption latency
> > spikes on large buffers.
> > - Pre-calculates and loads fold constants via vld1q_u64() to minimize
> > register spilling.
> > - Benchmarks show the break-even point against the generic implementation
> > is around 128 bytes. The PMULL path is enabled only for len >= 128.

Final thought:
Is that allowing for the cost of kernel_fpu_begin()? - which I think only
affects the first call.
And the cost of the data-cache misses for the lookup table reads? - again
worse for the first call.

David

> >
> > Performance results (kunit crc_benchmark on Cortex-A72):
> > - Generic (len=4096): ~268 MB/s
> > - PMULL (len=4096): ~1556 MB/s (nearly 6x improvement)
> >
> > Signed-off-by: Demian Shulhan <demyansh@xxxxxxxxx>
>
> Applied to https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git/log/?h=crc-next
>
> Thanks!
>
> - Eric
>