Re: [PATCH] lib/raid/xor: x86: Add AVX-512 optimized xor_gen()

From: Eric Biggers

Date: Fri Jun 12 2026 - 02:01:12 EST

On Fri, Jun 12, 2026 at 07:22:47AM +0200, Christoph Hellwig wrote:
> On Thu, Jun 11, 2026 at 09:40:34PM -0700, Eric Biggers wrote:
> > Add an implementation of xor_gen() using AVX-512.
>
> > Benchmark on AMD Ryzen 9 9950X (Zen 5):
>
> Can you share the benchmark?

For now I had just hacked up do_xor_speed() as follows and changed
xor_force() to xor_register(). There should be a benchmark added to the
KUnit test similar to the one in the crypto and CRC tests, though.

diff --git a/lib/raid/xor/xor-core.c b/lib/raid/xor/xor-core.c
index bd4e6e434418..8c5814af03d5 100644
--- a/lib/raid/xor/xor-core.c
+++ b/lib/raid/xor/xor-core.c
@@ -76,15 +76,24 @@ void __init xor_force(struct xor_block_template *tmpl)
#define REPS 800U

static void __init
-do_xor_speed(struct xor_block_template *tmpl, void *b1, void *b2)
+do_xor_speed(struct xor_block_template *tmpl, void *b1, void *b2,
+ void *b3, void *b4, void *b5)
{
+ for (int src_cnt = 1; src_cnt <= 4; src_cnt++) {
int speed;
unsigned long reps;
ktime_t min, start, t0;
- void *srcs[1] = { b2 };
+ void *srcs[4] = { b2, b3, b4, b5 };

preempt_disable();

+ /* warm-up */
+ for (int i = 0; i < 8000; i++) {
+ mb(); /* prevent loop optimization */
+ tmpl->xor_gen(b1, srcs, src_cnt, BENCH_SIZE);
+ mb();
+ }
+
reps = 0;
t0 = ktime_get();
/* delay start until time has advanced */
@@ -92,7 +101,7 @@ do_xor_speed(struct xor_block_template *tmpl, void *b1, void *b2)
cpu_relax();
do {
mb(); /* prevent loop optimization */
- tmpl->xor_gen(b1, srcs, 1, BENCH_SIZE);
+ tmpl->xor_gen(b1, srcs, src_cnt, BENCH_SIZE);
mb();
} while (reps++ < REPS || (t0 = ktime_get()) == start);
min = ktime_sub(t0, start);
@@ -105,26 +114,30 @@ do_xor_speed(struct xor_block_template *tmpl, void *b1, void *b2)

pr_info(" %-16s: %5d MB/sec\n", tmpl->name, speed);
}
+}

static int __init calibrate_xor_blocks(void)
{
- void *b1, *b2;
+ void *b1, *b2, *b3, *b4, *b5;
struct xor_block_template *f, *fastest;

if (forced_template)
return 0;

- b1 = (void *) __get_free_pages(GFP_KERNEL, 2);
+ b1 = (void *) __get_free_pages(GFP_KERNEL, 4);
if (!b1) {
pr_warn("xor: Yikes! No memory available.\n");
return -ENOMEM;
}
b2 = b1 + 2*PAGE_SIZE + BENCH_SIZE;
+ b3 = b2 + 2*PAGE_SIZE + BENCH_SIZE;
+ b4 = b3 + 2*PAGE_SIZE + BENCH_SIZE;
+ b5 = b4 + 2*PAGE_SIZE + BENCH_SIZE;

pr_info("xor: measuring software checksum speed\n");
fastest = template_list;
for (f = template_list; f; f = f->next) {
- do_xor_speed(f, b1, b2);
+ do_xor_speed(f, b1, b2, b3, b4, b5);
if (f->speed > fastest->speed)
fastest = f;
}

> In my local tree I have ports of the AVX2 and AVX512 implementations
> from snapraid (https://github.com/amadvance/snapraid), which in userspace
> give really good performance. On my Laptop with a AMD Ryzen AI 7 PRO 350
> (which is a Zen5 with the slower double pumped AVX512 unit), both of
> them get over 1GB/s throughput on the snapraid benchmarks. I've been
> holding them back as I don't have a good kernel benchmarking harness,
> and it's missing the quirks for old AVX512 or the newer AMD special
> cases.
>
> Attached for reference.
>
> Note that either way I'd prefer if we could get away from the stange
> old code organization with the DO{1-4} helpers which don't really
> help.

Well, doing the same on your avx512bw version and adding a column to my
table for it (by the way, I think it really just needs avx512f), I get:

src_cnt avx avx512 avx512bw
======= ========== ========== ==========
1 68423 MB/s 81940 MB/s 12067 MB/s
2 56035 MB/s 74112 MB/s 10958 MB/s
3 49396 MB/s 67011 MB/s 8608 MB/s
4 43056 MB/s 60823 MB/s 8069 MB/s

So, your version isn't great, I'm afraid. Making the inner loop be over
src_cnt does simplify the code a lot, but it destroys performance since
it turns into 9 instructions for each 64 bytes in each 3 buffers:

5b: 89 c1 mov %eax,%ecx
5d: 8d 70 01 lea 0x1(%rax),%esi
60: 48 8b 0c cb mov (%rbx,%rcx,8),%rcx
64: 48 8b 34 f3 mov (%rbx,%rsi,8),%rsi
68: 62 f1 fd 48 6f 0c 11 vmovdqa64 (%rcx,%rdx,1),%zmm1
6f: 62 f3 f5 48 25 04 16 vpternlogq $0x96,(%rsi,%rdx,1),%zmm1,%zmm0
76: 96
77: 83 c0 02 add $0x2,%eax
7a: 39 f8 cmp %edi,%eax
7c: 72 dd jb 5b <xor_gen_avx512bw+0x4b>

You could try unrolling by 512 bytes, which should help.

- Eric