RE: [PATCH v4 10/24] crypto: x86/poly - limit FPU preemption
From: Peter Lafreniere
Date: Tue Dec 06 2022 - 18:06:56 EST
> > Subject: Re: [PATCH v4 10/24] crypto: x86/poly - limit FPU preemption
> > Perhaps we should try a different approach. How about just limiting
> > the size to 4K, and then depending on need_resched we break out of
> > the loop? Something like:
> >
> > if (!len)
> > return 0;
> >
> > kernel_fpu_begin();
> > for (;;) {
> > unsigned int chunk = min(len, 4096);
> >
> > sha1_base_do_update(desc, data, chunk, sha1_xform);
> >
> > len -= chunk;
> > data += chunk;
> >
> > if (!len)
> > break;
> >
> > if (need_resched()) {
> > kernel_fpu_end();
> > cond_resched();
> > kernel_fpu_begin();
> > }
> > }
> > kernel_fpu_end();
>
>
> I implemented that conditional approach in the sha algorithms.
>
> The results of a boot (using sha512 for module signatures, with
> crypto extra tests enabled, comparing to sha512 with a 20 KiB
> fixed limit) are:
>
> sha1 cond: 14479 calls; 784256 cycles doing begin/end; longest FPU context 35828 cycles
> sha256 cond: 26763 calls; 1273570 cycles doing begin/end; longest FPU context 118612 cycles
> sha512 cond: 26957 calls; 1680046 cycles doing begin/end; longest FPU context 169140982 cycles
> sha512 20KiB: 161011 calls; 16232280 cycles doing begin/end; longest FPU context 4049644 cycles
>
> NOTE: I didn't have a patch in place to isolate the counts for each variation
> (ssse3 vs. avx vs. avx2) and
> - for sha512: sha512 vs. sha384
> - for sha256: sha256 vs. sha224
> so the numbers include sha256 and sha512 running twice as many tests
> as sha1.
>
> This approach looks very good:
> - 16% of the number of begin/end calls
> - 10% of the CPU cycles spent making the calls
> - the FPU context is held for a long time (77 ms) but only while
> it's not needed.
>
> That's much more efficient than releasing it every 30 us just in case.
How recently did you make this change? I implemented this conditional
approach for ecb_cbc_helpers.h, but saw no changes at all to performance
on serpent-avx2 and twofish-avx.
kernel_fpu_{begin,end} (after the first call to begin) don't do anything
more than enable/disable preemption and make a few writes to the mxcsr.
It's likely that the above approach has the tiniest bit less overhead,
and it will preempt on non CONFIG_PREEMPT kernels, but nothing suggests
a performance uplift.
This brings us back to this question: should crypto routines be
preempted under PREEMPT_VOLUNTARY or not?
> I'll keep testing this to make sure RCU stalls stay away, and apply
> the approach to the other algorithms.
I missed the earlier discussions. Have you seen issues with RCU
stalls/latency spikes because of crypto routines? If so, what preemption
model were you running?
> In x86, need_resched() has to deal with a PER_CPU variable, so I'm
> not sure it's worth the hassle to figure out how to do that from
> assembly code.
Leave it in c. It'll be more maintainable that way.
Cheers,
Peter Lafreniere <peter@xxxxxxxx>