Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

From: Ingo Molnar
Date: Sun Mar 08 2015 - 07:38:58 EST



* Ingo Molnar <mingo@xxxxxxxxxx> wrote:

> Doing that would give us four (theoretical) performance advantages:
>
> - No implicit irq disabling overhead when the syscall instruction is
> executed: we could change MSR_SYSCALL_MASK from 0xc0000084 to
> 0xc0000284, which removes the implicit CLI on syscall entry.
>
> - No explicit irq enabling overhead via ENABLE_INTERRUPTS() [STI] in
> system_call.
>
> - No explicit irq disabling overhead in the ret_from_sys_call fast
> path, i.e. no DISABLE_INTERRUPTS() [CLI].
>
> - No implicit irq enabling overhead in ret_from_sys_call's
> USERGS_SYSRET64: the SYSRETQ instruction would not have to
> re-enable irqs as the user-space IF in R11 would match that of the
> current IF.
>
> whether that's an actual performance win in practice as well needs
> to be measured, but I'd be (very!) shocked if it wasn't in the 20+
> cycles range: which is absolutely huge in terms of system_call
> optimizations.

So just to quantify the potential 64-bit system call entry fast path
performance savings a bit, I tried to simulate the effects in
user-space via a 'best case' simulation, where we do a PUSHFQ+CLI+STI
... CLI+POPFQ simulated syscall sequence (beginning and end
sufficiently far from each other to not be interacting), on Intel
family 6 model 62 CPUs (slightly dated but still relevant):

with irq disabling/enabling:

new best speed: 2710739 loops (158 cycles per iteration).

fully preemptible:

new best speed: 3389503 loops (113 cycles per iteration).

now that's an about 40 cycles difference, but admittedly the cost very
much depends on the way we save flags and on the way we restore flags
and depends on how intelligently the CPU can hide the irq disabling
and the restoration amongst other processing it has to do on
entry/exit, which it can do pretty well in a number of important
cases.

I don't think I can simulate the real thing in user-space:

- The hardest bit to simulate is SYSRET: POPFQ is expensive, but
SYSRET might be able to 'cheat' on the enabling side

- I _think_ it cannot cheat because user-space might have come in
with irqs disabled itself (we still have iopl(3)), so it's a POPFQ
equivalent instruction.

- OTOH the CPU might be able to hide the latency of the POPFQ
amongst other SYSRET return work (which is significant) - so this
is really hard to estimate.

So "we'll have to try it to see it" :-/ [and maybe Intel knows.]

But even if just half of the suspected savings can be realized: a 20
cycles speedup is very tempting IMHO, given that our 64-bit system
calls cost around 110 cycles these days.

Yes, it's scary, crazy, potentially fragile, might not even work, etc.
- but it's also very tempting nevertheless ...

So I'll try to write a prototype of this, just to be able to get some
numbers - but shoot me down if you think I'm being stupid and if the
concept is an absolute non-starter to begin with!

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/