RE: [PATCH v4 10/24] crypto: x86/poly - limit FPU preemption
From: Elliott, Robert (Servers)
Date: Fri Dec 02 2022 - 01:23:45 EST
> -----Original Message-----
> From: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx>
> Sent: Friday, November 25, 2022 2:41 AM
> To: Jason A. Donenfeld <Jason@xxxxxxxxx>
> Cc: Elliott, Robert (Servers) <elliott@xxxxxxx>; davem@xxxxxxxxxxxxx;
> tim.c.chen@xxxxxxxxxxxxxxx; ap420073@xxxxxxxxx; ardb@xxxxxxxxxx;
> David.Laight@xxxxxxxxxx; ebiggers@xxxxxxxxxx; linux-
> crypto@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH v4 10/24] crypto: x86/poly - limit FPU preemption
>
> On Wed, Nov 16, 2022 at 12:13:51PM +0100, Jason A. Donenfeld wrote:
> > On Tue, Nov 15, 2022 at 10:13:28PM -0600, Robert Elliott wrote:
> > > +/* avoid kernel_fpu_begin/end scheduler/rcu stalls */
> > > +static const unsigned int bytes_per_fpu = 337 * 1024;
> > > +
> >
> > Use an enum for constants like this:
> >
> > enum { BYTES_PER_FPU = ... };
> >
> > You can even make it function-local, so it's near the code that uses it,
> > which will better justify its existence.
> >
> > Also, where did you get this number? Seems kind of weird.
>
> These numbers are highly dependent on hardware and I think having
> them hard-coded is wrong.
>
> Perhaps we should try a different approach. How about just limiting
> the size to 4K, and then depending on need_resched we break out of
> the loop? Something like:
>
> if (!len)
> return 0;
>
> kernel_fpu_begin();
> for (;;) {
> unsigned int chunk = min(len, 4096);
>
> sha1_base_do_update(desc, data, chunk, sha1_xform);
>
> len -= chunk;
> data += chunk;
>
> if (!len)
> break;
>
> if (need_resched()) {
> kernel_fpu_end();
> cond_resched();
> kernel_fpu_begin();
> }
> }
> kernel_fpu_end();
>
I implemented that conditional approach in the sha algorithms.
The results of a boot (using sha512 for module signatures, with
crypto extra tests enabled, comparing to sha512 with a 20 KiB
fixed limit) are:
sha1 cond: 14479 calls; 784256 cycles doing begin/end; longest FPU context 35828 cycles
sha256 cond: 26763 calls; 1273570 cycles doing begin/end; longest FPU context 118612 cycles
sha512 cond: 26957 calls; 1680046 cycles doing begin/end; longest FPU context 169140982 cycles
sha512 20KiB: 161011 calls; 16232280 cycles doing begin/end; longest FPU context 4049644 cycles
NOTE: I didn't have a patch in place to isolate the counts for each variation
(ssse3 vs. avx vs. avx2) and
- for sha512: sha512 vs. sha384
- for sha256: sha256 vs. sha224
so the numbers include sha256 and sha512 running twice as many tests
as sha1.
This approach looks very good:
- 16% of the number of begin/end calls
- 10% of the CPU cycles spent making the calls
- the FPU context is held for a long time (77 ms) but only while
it's not needed.
That's much more efficient than releasing it every 30 us just in case.
I'll keep testing this to make sure RCU stalls stay away, and apply
the approach to the other algorithms.
In x86, need_resched() has to deal with a PER_CPU variable, so I'm
not sure it's worth the hassle to figure out how to do that from
assembly code.