Re: [PATCH] lib/crc: arm64: add NEON accelerated CRC64-NVMe implementation

From: Demian Shulhan

Date: Sun Mar 22 2026 - 05:30:07 EST

Hi Eric, David,

Thanks to both of you for the review and suggestions! I've addressed
all your comments and will send the v2 patch shortly.

The idea of a unified PMULL template for ARM64 is very interesting. I
can own it and work on it, but as it requires careful design (parallel
folding across multiple vectors, handling LSB/MSB differences, and
generalizing Barrett reduction), it will take some time to implement
and test properly.

Do you think it makes sense to merge this current solution(with fixed
comments) for now, and I will follow up with the general template
implementation in a separate patchset later?

Thanks,
Demian

пт, 20 бер. 2026 р. о 22:00 Eric Biggers <ebiggers@xxxxxxxxxx> пише:
>
> On Fri, Mar 20, 2026 at 10:36:24AM +0000, David Laight wrote:
> > I'm also pretty sure that the same loop will process 32bit and 16bit CRC
> > (just needs the high bits of the constant multiplier set to zero).
> > There are fewer bits to correct for at the end (I think it is always
> > the size of the CRC) but that may not be worth worrying about.
>
> Again, see lib/crc/x86/ and lib/crc/riscv/ which do basically this.
>
> > It might be better to write some C that required the architecture provide
> > the functions required for doing a CRC with 128bit registers that hold
> > two 64bit values (etc) and give them sane names.
> >
> > Then common C code can be used provided the required instructions exist.
>
> While it would be great to share more CRC code between architectures by
> using a C "template" combined with some arch-dependent inline asm
> blocks, there's actually a lot of variation in what instructions and
> register widths the different architectures have.
>
> lib/crc/riscv/crc-clmul-template.h actually has something very similar
> to this already: it's written in C, and there are just three
> single-instruction inline asm blocks to access RISC-V's clmul
> instructions. Unfortunately, the carryless multiplication instructions
> on the other architectures are not compatible with these. So, it's hard
> to make it anything more than RISC-V specific code.
>
> There might be enough similarity between arm, arm64, and x86_64 for them
> to share code using a similar "template". However, consider that for
> x86_64 we need to support different register widths. See
> lib/crc/x86/crc-pclmul-template.S.
>
> > I'm pretty sure the loop is effectively:
> > for (; p < limit; p++)
> > p[N] ^= low(*p) * const_a ^ high(*p) * const_b;
> > where N is at least one and you don't actually want to write into the buffer.
> > Making N > 1 should improve performance - just needs care.
>
> Well, you're welcome to read the actual code and not just speculate.
>
> But again, maybe best to not get too sidetracked for now, unless you or
> Demian are actually planning to work on the more general version.
>
> - Eric