Re: [PATCH v1 1/1] lib/crypto: tests: Add KUnit tests for AES

From: Holger Dengler

Date: Fri Jan 16 2026 - 12:32:11 EST

Hi David,

On 15/01/2026 23:05, David Laight wrote:
> On Thu, 15 Jan 2026 12:43:32 -0800
> Eric Biggers <ebiggers@xxxxxxxxxx> wrote:
>>> +static void benchmark_aes(struct kunit *test, const struct aes_testvector *tv)
>>> +{
>>> + const size_t num_iters = 10000000;
>>
>> 10000000 iterations is too many. That's 160 MB of data in each
>> direction per AES key length. Some CPUs without AES instructions can do
>> only ~20 MB AES per second. In that case, this benchmark would take 16
>> seconds to run per AES key length, for 48 seconds total.
>
> Probably best to first do a test that would take a 'reasonable' time
> on a cpu without AES. If that is 'very fast' then do a longer test
> to get more accuracy on a faster implementation.
>
>>
>> hash-test-template.h and crc_kunit.c use 10000000 / (len + 128)
>> iterations. That would be 69444 in this case (considering len=16),
>> which is less than 1% of the iterations you've used. Choosing a number
>> similar to that would seem more appropriate.
>>
>> Ultimately these are just made-up numbers. But I think we should aim
>> for the benchmark test in each KUnit test suite to take less than a
>> second or so. The existing tests roughly achieve that, whereas it seems
>> this one can go over it by quite a bit due to the 10000000 iterations.
>
> Even 1 second is a long time, you end up getting multiple interrupts included.
> I think a lot of these benchmarks are far too long.
> Timing differences less that 1% can be created by scheduling noise.
> Running a test that takes 200 'quanta' of the timer used has an
> error margin of under 1% (100 quanta might be enough).
> While the kernel timestamps have a resolution of 1ns the accuracy is worse.
> If you run a test for even just 10us you ought to get reasonable accuracy
> with a reasonable hope of not getting an interrupt.
> Run the test 10 times and report the fastest value.
>
> You'll then find the results are entirely unstable because the cpu clock
> frequency keeps changing.
> And long enough buffers can get limited by the d-cache loads.
>
> For something as slow as AES you can count the number of cpu cycles for
> a single call and get a reasonably consistent figure.
> That will tell you whether the loop is running at the speed you might
> expect it to run at.
> (You need to use data dependencies between the start/end 'times' and
> start/end of the code being timed, x86 lfence/mfence are too slow and
> can hide the 'setup' cost of some instructions.)

Thanks a lot for your feedback. I tried a few of your ideas and it turns out,
that they work quite well. First of all, with a single-block aes
encrypt/decrypt in our hardware (CPACF), we're very close to the resolution of
our CPU clock.

Disclaimer: The encryption/decryption of one block takes ~32ns (~500MB/s).
These numbers should be taken with some care, as on s390 the operating system
always runs virtualized. In my test environment, I also only have access to a
machine with shared CPUs, so there might be some negative impact from other
workload.

The benchmark loops for 100 iterations now without any warm-up. In each
iteration, I measure a single aes_encrypt()/aes_decrypt() call. The lowest
value of these measurements is takes as the value for the bandwidth
calculations. Although it is not necessary in my environment, I'm doing all
iterations with preemption disabled. I think, that this might help on other
platforms to reduce the jitter of the measurement values.

The removal of the warm-up does not have any impact on the numbers.

Just for information: I also tried to measure the cycles with the same
results. The minimal measurement value of a few iterations is much more stable
that the average over a larger number of iterations.

I also did some tests with IRQs disabled (instead of only preemption), but the
numbers stay the same. So I think, it is save enough to stay with disables
preemption.

I also tried you idea, first to do a few measurements and if they are fast
enough, increase the number of iterations. But it turns out, that this it not
really necessary (at least in my env). But I can add this, it it makes sense
on other platforms.

--
Mit freundlichen Grüßen / Kind regards
Holger Dengler
--
IBM Systems, Linux on IBM Z Development
dengler@xxxxxxxxxxxxx