Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs

From: Maciej W. Rozycki
Date: Mon Feb 23 2015 - 16:17:51 EST

On Sat, 21 Feb 2015, Andy Lutomirski wrote:

> > Additionally I believe long-executing FPU instructions (i.e.
> > transcendentals) can take advantage of continuing to execute in parallel
> > where the context has already been switched rather than stalling an eager
> > FPU context switch until the FPU instruction has completed.
> It seems highly unlikely to me that a slow FPU instruction can retire
> *after* a subsequent fxsave, which would need to happen for this to
> work.

I meant something else -- a slow FPU instruction can retire after a task
has been switched where the FP context has been left intact, i.e. in the
lazy FP context switching case, where only the MMU context and GPRs have
been replaced. Whereas in the eager FP context switching case you can get
through to FXSAVE while a slow FPU instruction hasn't completed yet (e.g.
started just as preemption was about to happen).

Obviously that FXSAVE will have to stall until the FPU instruction has
completed (IIRC the i486 aborted transcendental instructions on any
exceptions/interrupts instead, leading to the risk of process starvation
in heavily interrupt loaded systems, but I also believe it has been fixed
as from the Pentium). Though if, as you say, the lone taking of a
trap/interrupt gate can take hundreds of cycles, perhaps indeed no FPU
instruction will execute *that* long on modern silicon.

> > And last but not least, why does the handling of CR0.TS traps have to be
> > complicated? It does not look like rocket science to me, it should be a
> > mere handful of instructions, the time required to move the two FP
> > contexts out from and in to the FPU respectively should dominate
> > processing time. Where quoted the optimisation manual states 250 cycles
> > for FXSAVE and FXRSTOR combined.
> The TS traps aren't complicated -- they're just really slow. I think
> that each of setting *and* clearing TS serializes and takes well over
> a hundred cycles. A #NM interrupt (the thing that happens if you try
> to use the FPU with TS set) serializes and does all kinds of slow
> things, so it takes many hundreds of cycles. The handler needs to
> clear TS (another hundred cycles or more), load the FPU state
> (actually rather fast on modern CPUs), and then IRET back to userspace
> (hundreds of cycles). This adds up to a lot of cycles. A round trip
> through an exception handler seems to be thousands of cycles.

That sucks wet goat farts! :(

I have to admit I got moved a bit away from the x86 world and didn't
realise things have become so bad. Some 10 years ago or so taking a trap
or interrupt gate would need some 30 cycles (of course task gates are
unusable for anything that does not absolutely require them such as a #DF;
their usability for anything real ended with the 80286 or suchlike).
Similarly an IRET to reverse the actions taken. That was already rather
bad, but understandable, after all the CPU had to read the gate
descriptor, access the TSS, switch both CS and SS descriptors, etc.

What I don't understand is why CLTS, a dedicated instruction that avoids
the need to access whole CR0 (that again can understandably be costly,
because of the grouping of all the important bits there), has to be so
slow. It flips a single bit down there and does not to serialise
anything, as any instruction down the pipeline it could affect would
trigger a #NM anyway! And there's an IRET somewhere on the way too,
before the instruction that originally triggered the fault will be

And why the heck over all these years a mechanism similar to SYSENTER and
its bunch of complementing MSRs hasn't been invented for the common
exceptions, to avoid all this gate descriptor dance!

> > And of course you can install the right handler (i.e. FSAVE vs FXSAVE) at
> > bootstrap depending on processor features, you don't have to do all the
> > run-time check on every trap. You can even optimise the FSAVE handler
> > away at the build time if you know it won't ever be used based on the
> > minimal supported processor family selected.
> None of this matters. A couple of branches in an #NM handler are
> completely lost in the noise.

Agreed, given what you state, completely understood.

> > Do you happen to know or can determine how much time (in clock cycles) a
> > CR0.TS trap itself takes, including any time required to preserve the
> > execution state in the handler such as pushing/popping GPRs to/from the
> > stack (as opposed to processing time spent on moving the FP contexts back
> > and forth)? Is there no room for improvement left there? How many task
> > scheduling slots say per million must be there poking at the FPU for eager
> > FPU context switching to take advantage over lazy one?
> Thousands of cycles. Considerably worse in an SVM guest. x86
> exception handling sucks.

I must have been spoilt with the MIPS exception handling. Taking an
exception on a MIPS processor is virtually instantaneous, just like
retiring another instruction. Of course there's the cost equivalent to
branch misprediction, you need to invalidate all the pipeline. So
depending on how many stages you have there, you can expect a latency of
say 3-7 clocks.

Granted, on a MIPS processor taking an exception does not change much --
it switches into the kernel mode (1 bit set in a control register, a
special kernel-mode-override bit dedicated to exception handling), saves
the old PC (another control register updated; called Exception PC or EPC)
and loads the PC with the exception vector. All the rest is left to the
kernel. Which is good!

The same stands for ERET, the exception return instruction -- it merely
loads the PC back from EPC and clears the kernel-mode-override bit in the
other control register. More recently it also serves the purpose of an
instruction hazard barrier, which you'd call synchronisation, the
strongest kind provided in the MIPS architecture (in older architecture
revisions you had to take care of any hazards caused by preceding
instructions that could affect user-mode execution, by inserting the right
number of NOPs before ERET, possibly taking other instructions already
executed since the origin of the hazard into account). So rather than 3-7
clocks that could be 20 or so, though usually much fewer.

A while ago I cooperated with the hardware team in adding an extra
instruction to the architecture under the assumption that it will be
emulated on legacy hardware, by taking the RI or Reserved Instruction
exception (the equivalent to x86's #UD) and doing the rest there.
Another assumption was a fast path would be taken for this single
instruction and all the handling done in assembly, without even reaching
the usual C-language RI handlers that we've accumulated over the years.

Mind that exceptions actually have to be decoded and dispatched to
individual handlers on MIPS processors first, it's not that individual
exception classes have individual vectors like with x86 -- there's only
one! And you need to update EPC too or you'd be trapping back. Finally
the instruction itself had to be decoded, so instruction memory had to be
read and compared against the pattern expected.

To make a long story short I was able to squeeze all the handling into
some 30 cycles, with a slight variation across different processor
implementations. How much different!

Oh well, some further benchmarking is still needed, but given the
circumstances I suppose the old good design will have to go after all,
sigh... Thanks for your input!

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at