Re: x86 memcpy performance

From: Andrew Lutomirski
Date: Mon Aug 15 2011 - 13:04:55 EST

On Mon, Aug 15, 2011 at 12:12 PM, Borislav Petkov <bp@xxxxxxxxx> wrote:
> On Mon, 15 August, 2011 5:36 pm, Andrew Lutomirski wrote:
>>> But still, irq_fpu_usable() still checks !in_interrupt() which means
>>> that we don't want to run SSE instructions in IRQ context. OTOH, we
>>> still are fine when running with CR0.TS. So what happens when we get an
>>> #NM as a result of executing an FPU instruction in an IRQ handler? We
>>> will have to do init_fpu() on the current task if the last hasn't used
>>> math yet and do the slab allocation of the FPU context area (I'm looking
>>> at math_state_restore, btw).
>> IIRC kernel_fpu_begin does clts, so #NM won't happen.  But if we're in
>> an interrupt and TS=1, when we know that we're not in a
>> kernel_fpu_begin section, so it's safe to start one (and do clts).
> Doh, yes, I see it now. This way we save the math state of the current
> process if needed and "disable" #NM exceptions until kernel_fpu_end() by
> clearing CR0.TS, sure. Thanks.
>> IMO this code is not very good, and I plan to fix it sooner or later.
> Yep. Also, AFAIR, Hans did some FPU cleanup as part of his xsave rework.
> You could probably reuse some bits from there. The patchset should be in
> tip/x86/xsave.
>> I want kernel_fpu_begin (or its equivalent*) to be very fast and
>> usable from any context whatsoever.  Mucking with TS is slower than a
>> complete save and restore of YMM state.
> Well, I had a SSE memcpy which saved/restored the XMM regs on the stack.
> This would obviate the need to muck with contexts but that could get
> expensive wrt stack operations. The advantage is that I'm not dealing
> with the whole FPU state but only with 16 XMM regs. I should probably
> dust off that version again and retest.

I bet it won't be a significant win. On Sandy Bridge, clts/stts takes
80 ns and a full state save+restore is only ~60 ns. Without
infrastructure changes, I don't think you can avoid the clts and stts.

You might be able to get away with turning off IRQs, reading CR0 to
check TS, pushing XMM regs, and being very certain that you don't
accidentally generate any VEX-coded instructions.

> Or, if we want to use SSE stuff in the kernel, we might think of
> allocating its own FPU context(s) and handle those...

I'm thinking of having a stack of FPU states to parallel irq stacks
and IST stacks. It gets a little hairy when code inside
kernel_fpu_begin traps for a non-irq non-IST reason, though.
Fortunately, those are rare and all of the EX_TABLE users could mark
xmm regs as clobbered (except for copy_from_user...). Keeping
kernel_fpu_begin non-preemptable makes it less bad because the extra
FPU state can be per-cpu and not per-task.

This is extra fun on 32 bit, which IIRC doesn't have IST stacks.

The major speedup will come from saving state in kernel_fpu_begin but
not restoring it until the code in entry_??.S restores registers.

>> (*) kernel_fpu_begin is a bad name. It's only safe to use integer
>> instructions inside a kernel_fpu_begin section because MXCSR (and the
>> 387 equivalent) could contain garbage.
> Well, do we want to use floating point instructions in the kernel?

The only use I could find is in staging.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at