Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs

From: Maciej W. Rozycki
Date: Sat Feb 21 2015 - 19:34:28 EST


On Sat, 21 Feb 2015, Borislav Petkov wrote:

> Provided I've not made a mistake, this leads me to think that this
> simple workload and pretty much everything else uses the FPU through
> glibc which does the SSE memcpy and so on. Which basically kills the
> whole idea behind lazy FPU as practically you don't really encounter
> workloads nowadays which don't use the FPU thanks to glibc and the lazy
> strategy doesn't really bring anything.
>
> Which would then mean, we don't really need the lazy handling as
> userspace is making it eager, so to speak, for us.

Please correct me if I'm wrong, but it looks to me like you're confusing
lazy FPU context allocation and lazy FPU context switching. These build
on the same hardware principles, but they are different concepts.

Your "userspace is making it eager" statement in the context of glibc
using SSE for `memcpy' is certainly true for lazy FPU context allocation,
however I wouldn't be so sure about lazy FPU context switching, and a
kernel compilation (or in fact any compilation) does not appear to be a
representative benchmark to me. I am sure lots of software won't be
calling `memcpy' all the time, there should be context switches between
which the FPU is not referred to at all.

Also, does `__builtin_memcpy' also expand to SSE? I'd expect it rather
than external `memcpy' to be used by GCC for copying fixed amounts of
data, especially smaller ones such as when passing structures by value in
function calls or for string operations like `strdup' or suchlike. These
I'd expect to be ubiquitous, whereas external `memcpy' I'd expect to be
called from time to time only.

Additionally I believe long-executing FPU instructions (i.e.
transcendentals) can take advantage of continuing to execute in parallel
where the context has already been switched rather than stalling an eager
FPU context switch until the FPU instruction has completed.

And last but not least, why does the handling of CR0.TS traps have to be
complicated? It does not look like rocket science to me, it should be a
mere handful of instructions, the time required to move the two FP
contexts out from and in to the FPU respectively should dominate
processing time. Where quoted the optimisation manual states 250 cycles
for FXSAVE and FXRSTOR combined.

And of course you can install the right handler (i.e. FSAVE vs FXSAVE) at
bootstrap depending on processor features, you don't have to do all the
run-time check on every trap. You can even optimise the FSAVE handler
away at the build time if you know it won't ever be used based on the
minimal supported processor family selected.

Do you happen to know or can determine how much time (in clock cycles) a
CR0.TS trap itself takes, including any time required to preserve the
execution state in the handler such as pushing/popping GPRs to/from the
stack (as opposed to processing time spent on moving the FP contexts back
and forth)? Is there no room for improvement left there? How many task
scheduling slots say per million must be there poking at the FPU for eager
FPU context switching to take advantage over lazy one?

Maciej
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/