Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

From: Andy Lutomirski
Date: Tue Mar 20 2018 - 10:57:36 EST


On Tue, Mar 20, 2018 at 8:26 AM, Ingo Molnar <mingo@xxxxxxxxxx> wrote:
>
> * Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
>
>> > Useful also for code that needs AVX-like registers to do things like CRCs.
>>
>> x86/crypto/ has a lot of AVX optimized code.
>
> Yeah, that's true, but the crypto code is processing fundamentally bigger blocks
> of data, which amortizes the cost of using kernel_fpu_begin()/_end().
>
> kernel_fpu_begin()/_end() is a pretty heavy operation because it does a full FPU
> save/restore via the XSAVE[S] and XRSTOR[S] instructions, which can easily copy a
> thousand bytes around! So kernel_fpu_begin()/_end() is probably a non-starter for
> something small, like a single 256-bit or 512-bit word access.
>
> But there's actually a new thing in modern kernels: we got rid of (most of) lazy
> save/restore FPU code, our new x86 FPU model is very "direct" with no FPU faults
> taken normally.
>
> So assuming the target driver will only load on modern FPUs I *think* it should
> actually be possible to do something like (pseudocode):
>
> vmovdqa %ymm0, 40(%rsp)
> vmovdqa %ymm1, 80(%rsp)
>
> ...
> # use ymm0 and ymm1
> ...
>
> vmovdqa 80(%rsp), %ymm1
> vmovdqa 40(%rsp), %ymm0
>

I think this kinda sorts works, but only kinda sorta:

- I'm a bit worried about newer CPUs where, say, a 256-bit vector
operation will implicitly clear the high 768 bits of the regs. (IIRC
that's how it works for the new VEX stuff.)

- This code will cause XINUSE to be set, which is user-visible and
may have performance and correctness effects. I think the kernel may
already have some but where it occasionally sets XINUSE on its own,
and this caused problems for rr in the past. This might not be a
total showstopper, but it's odd.

I'd rather see us finally finish the work that Rik started to rework
this differently. I'd like kernel_fpu_begin() to look like:

if (test_thread_flag(TIF_NEED_FPU_RESTORE)) {
return; // we're already okay. maybe we need to check
in_interrupt() or something, though?
} else {
XSAVES/XSAVEOPT/XSAVE;
set_thread_flag(TIF_NEED_FPU_RESTORE):
}

and kernel_fpu_end() does nothing at all.

We take the full performance hit for a *single* kernel_fpu_begin() on
an otherwise short syscall or interrupt, but there's no additional
cost for more of them or for long-enough-running things that we
schedule in the middle.

As I remember, the main hangup was that this interacts a bit oddly
with PKRU, but that's manageable.

--Andy