Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch
From: Jason A. Donenfeld
Date: Mon Jun 18 2018 - 11:26:11 EST
On Mon, Jun 18, 2018 at 11:44 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Fri, Jun 15, 2018 at 10:30:46PM +0200, Jason A. Donenfeld wrote:
> > On Fri, Jun 15, 2018 at 9:34 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> > > Didn't we recently do a bunch of crypto patches to help with this?
> > >
> > > I think they had the pattern:
> > >
> > > kernel_fpu_begin();
> > > for (units-of-work) {
> > > do_unit_of_work();
> > > if (need_resched()) {
> > > kernel_fpu_end();
> > > cond_resched();
> > > kernel_fpu_begin();
> > > }
> > > }
> > > kernel_fpu_end();
> >
> > Right, so that's the thing -- this is an optimization easily available
> > to individual crypto primitives. But I'm interested in applying this
> > kind of optimization to an entire queue of, say, tiny packets, where
> > each packet is processed individually. Or, to a cryptographic
> > construction, where several different primitives are used, such that
> > it'd be meaningful not to have to get the performance hit of
> > end()begin() in between each and everyone of them.
>
> I'm confused.. how does the above not apply to your situation?
In the example you've given, the optimization is applied at the level
of the, say, encryption function. Suppose you send a scattergather off
to an encryption function, which then walks the sglist and encrypts
each of the parts using some particular key. For each of the parts, it
benefits from the above optimization.
But what I'm referring to is encrypting multiple different things,
with different keys. In the case I'd like to optimize, I have a worker
thread that's processing a large queue of separate sglists and
encrypting them separately under different keys. In this case, having
kernel_fpu_begin/end inside the encryption function itself is a
problem, since that means toggling the FPU in between every queue
item. The solution, for now, is to just hoist the kernel_fpu_begin/end
out of the encryption function, and put them instead at the beginning
and end of my worker thread that handles all the items of the queue.
This is fine and dandy, but far from ideal, as putting that kind of
logic inside the encryption function itself makes more sense. For
example, the encryption function can decide whether or not it even
wants to use the FPU before calling kernel_fpu_begin. Ostensibly this
logic too could be hoisted outside, but at what point do you draw the
line and decide these optimizations are leading the API in the wrong
direction?
Hence, the idea here in this thread is to make it cost-free to place
kernel_fpu_begin/end as close as possible to the actual use of the
FPU. The solution, it seems, is to have the actual kernel_fpu_end work
occur on context switch.