Re: [PATCH v2] raid6: arm64: add SVE optimized implementation for syndrome generation

From: Ard Biesheuvel

Date: Mon Mar 30 2026 - 12:43:10 EST

Hi Demian,

On Sun, 29 Mar 2026, at 15:01, Demian Shulhan wrote:
> I want to address the comment about the marginal 0.3% speedup on the
> 8-disk benchmark. While the pure memory bandwidth on a small array is
> indeed bottlenecked, it doesn't reveal the whole picture. I extracted
> the SVE and NEON implementations into a user-space benchmark to
> measure the actual hardware efficiency using perf stat, running on the
> same AWS Graviton3 (Neoverse-V1) instance.The results show a massive
> difference in CPU efficiency. For the same 8-disk workload, the svex4
> implementation requires about 35% fewer instructions and 46% fewer CPU
> cycles compared to neonx4 (7.58 billion instructions vs 11.62
> billion). This translates directly into significant energy savings and
> reduced pressure on the CPU frontend, which would leave more compute
> resources available for network and NVMe queues during an array
> rebuild.
>

I think the results are impressive, but I'd like to better understand
its implications on a real-world scenario. Is this code only a
bottleneck when rebuilding an array? Is it really that much more power
efficient, given that the registers (and ALU paths) are twice the size?
And given the I/O load of rebuilding a 24+ disk array, how much CPU
throughput can we make use of meaningfully in such a scenario?

Supporting SVE in the kernel primarily impacts the size of the per-task
buffers that we need to preserve/restore the context. Fortunately,
these are no longer allocated for the lifetime of the task, but
dynamically (by scoped_ksimd()), and so the main impediment has been
recently removed. But as Mark pointed out, there are other things to
take into account. Nonetheless, our position has always been that a
compelling use case could convince us that the additional complexity
of in-kernel SVE is justified.

> Furthermore, as Christoph suggested, I tested scalability on wider
> arrays since the default kernel benchmark is hardcoded to 8 disks,
> which doesn't give the unrolled SVE loop enough data to shine. On a
> 16-disk array, svex4 hits 15.1 GB/s compared to 8.0 GB/s for neonx4.
> On a 24-disk array, while neonx4 chokes and drops to 7.8 GB/s, svex4
> maintains a stable 15.0 GB/s — effectively doubling the throughput.

Does this mean the kernel benchmark is no longer fit for purpose? If
it cannot distinguish between implementations that differ in performance
by a factor of 2, I don't think we can rely on it to pick the optimal one.

> I agree this patch should be put on hold for now. My intention is to
> leave these numbers here as evidence that implementing SVE context
> preservation in the kernel (the "good use case") is highly justifiable
> from both a power-efficiency and a wide-array throughput perspective
> for modern ARM64 hardware.
>

Could you please summarize the results? The output below seems to have
become mangled a bit. Please also include the command line, a link to
the test source, and the vector length of the implementation.

> Thanks again for your time and time and review!
>
> ---------------------------------------------------
> User space test results:
> ==================================================
> RAID6 SVE Benchmark Results (AWS Graviton3)
> ==================================================
> Instance Details:
> Linux ip-172-31-87-234 6.8.0-1047-aws #50~22.04.1-Ubuntu SMP Thu Feb
> 19 20:49:25 UTC 2026 aarch64 aarch64 aarch64 GNU/Linux
> --------------------------------------------------
>
> [Test 1: Energy Efficiency / Instruction Count (8 disks)]
> Running baseline (neonx4)...
> algo=neonx4 ndisks=8 iterations=1000000 time=2.681s MB/s=8741.36
>
> Running SVE (svex1)...
>
> Performance counter stats for './raid6_bench neonx4 8 1000000':
>
> 11626717224 instructions # 1.67
> insn per cycle
> 6946699489 cycles
> 257013219 L1-dcache-load-misses
>
> 2.681213149 seconds time elapsed
>
> 2.676771000 seconds user
> 0.002000000 seconds sys
>
>
> algo=svex1 ndisks=8 iterations=1000000 time=1.688s MB/s=13885.23
>
> Performance counter stats for './raid6_bench svex1 8 1000000':
>
> 10527277490
> Running SVE unrolled x4 (svex4)...
> instructions # 2.40 insn per cycle
> 4379539835 cycles
> 175695656 L1-dcache-load-misses
>
> 1.688852006 seconds time elapsed
>
> 1.687298000 seconds user
> 0.000999000 seconds sys
>
>
> algo=svex4 ndisks=8 iterations=1000000 time=1.445s MB/s=16215.04
>
> Performance counter stats for './raid6_bench svex4 8 1000000':
>
> 7587813392 instructions
> ==================================================
> [Test 2: Scalability on Wide RAID Arrays (MB/s)]
> --- 16 Disks ---
> # 2.02 insn per cycle
> 3748486131 cycles
> 213816184 L1-dcache-load-misses
>
> 1.446032415 seconds time elapsed
>
> 1.442412000 seconds user
> 0.002996000 seconds sys
>
>
> algo=neonx4 ndisks=16 iterations=1000000 time=6.783s MB/s=8062.33
> algo=svex1 ndisks=16 iterations=1000000 time=4.912s MB/s=11132.90
> algo=svex4 ndisks=16 iterations=1000000 time=3.601s MB/s=15188.85
>
> --- 24 Disks ---
> algo=neonx4 ndisks=24 iterations=1000000 time=11.011s MB/s=7805.02
> algo=svex1 ndisks=24 iterations=1000000 time=8.843s MB/s=9718.26
> algo=svex4 ndisks=24 iterations=1000000 time=5.719s MB/s=15026.92
>
> Extra tests:
> --- 48 Disks ---
> algo=neonx4 ndisks=48 iterations=500000 time=11.826s MB/s=7597.25
> algo=svex4 ndisks=48 iterations=500000 time=5.808s MB/s=15468.10
> --- 96 Disks ---
> algo=neonx4 ndisks=96 iterations=200000 time=9.783s MB/s=7507.01
> algo=svex4 ndisks=96 iterations=200000 time=4.701s MB/s=15621.17
> ==================================================
>