RE: FPU register granularity [Was: Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks]
From: David Laight
Date: Tue Apr 21 2020 - 04:18:05 EST
From: Jason A. Donenfeld
> Sent: 21 April 2020 05:15
>
> Hi David,
>
> On Mon, Apr 20, 2020 at 2:32 AM David Laight <David.Laight@xxxxxxxxxx> wrote:
> > Maybe kernel_fp_begin() should be passed the address of somewhere
> > the address of an fpu save area buffer can be written to.
> > Then the pre-emption code can allocate the buffer and save the
> > state into it.
>
> Interesting idea. It looks like `struct xregs_state` is only 576
> bytes. That's not exactly small, but it's not insanely huge either,
> and maybe we could justifiably stick that on the stack, or even
> reserve part of the stack allocation for that that the function would
> know about, without needing to specify any address.
As you said yourself, with AVX512 it is much larger.
Which is why I suggested the save code could allocate the area.
Note that this would only be needed for nested use (for a full save).
> > kernel_fpu_begin() ought also be passed a parameter saying which
> > fpu features are required, and return which are allocated.
> > On x86 this could be used to check for AVX512 (etc) which may be
> > available in an ISR unless it interrupted inside a kernel_fpu_begin()
> > section (etc).
> > It would also allow optimisations if only 1 or 2 fpu registers are
> > needed (eg for some of the crypto functions) rather than the whole
> > fpu register set.
>
> For AVX512 this probably makes sense, I suppose. But I'm not sure if
> there are too many bits of crypto code that only use a few registers.
> There are those accelerated memcpy routines in i915 though -- ever see
> drivers/gpu/drm/i915/i915_memcpy.c? sort of wild. But if we did go
> this way, I wonder if it'd make sense to totally overengineer it and
> write a gcc/as plugin to create the register mask for us. Or, maybe
> some checker inside of objtool could help here.
I suspect some of that code is overly unrolled.
> Actually, though, the thing I've been wondering about is actually
> moving in the complete opposite direction: is there some
> efficient-enough way that we could allow FPU registers in all contexts
> always, without the need for kernel_fpu_begin/end? I was reversing
> ntoskrnl.exe and was kind of impressed (maybe not the right word?) by
> their judicious use of vectorisation everywhere. I assume a lot of
> that is being generated by their compiler, which of course gcc could
> do for us if we let it. Is that an interesting avenue to consider? Or
> are you pretty certain that it'd be a huge mistake, with an
> irreversible speed hit?
I think windows takes the 'hit' of saving the entire fpu state on
every kernel entry.
Note that for system calls this is actually minimal.
All the 'callee saved' registers (most of the fpu ones) can be
trashed - ie reloaded with zeros.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)