Re: x86 memcpy performance

From: Borislav Petkov
Date: Mon Aug 15 2011 - 14:50:08 EST

Next message: Doug Anderson: "[PATCH] i2c: tegra: Check for overflow errors with BUG_ON."
Previous message: Emilio G. Cota: "Re: [PATCH 3/5] staging: vme: add functions for bridge modulerefcounting"
In reply to: Andrew Lutomirski: "Re: x86 memcpy performance"
Next in thread: Andrew Lutomirski: "Re: x86 memcpy performance"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, 15 August, 2011 7:04 pm, Andrew Lutomirski wrote:
>> Well, I had a SSE memcpy which saved/restored the XMM regs on the stack.
>> This would obviate the need to muck with contexts but that could get
>> expensive wrt stack operations. The advantage is that I'm not dealing
>> with the whole FPU state but only with 16 XMM regs. I should probably
>> dust off that version again and retest.
>
> I bet it won't be a significant win. On Sandy Bridge, clts/stts takes
> 80 ns and a full state save+restore is only ~60 ns.
> Without infrastructure changes, I don't think you can avoid the clts
> and stts.

Yeah, probably.

> You might be able to get away with turning off IRQs, reading CR0 to
> check TS, pushing XMM regs, and being very certain that you don't
> accidentally generate any VEX-coded instructions.

That's ok - I'm using movaps/movups. But, the problem is that I still
need to save FPU state if the task I'm interrupting has been using FPU
instructions. So, I can't get away without saving the context in which
case I don't need to save the XMM regs anyway.

>> Or, if we want to use SSE stuff in the kernel, we might think of
>> allocating its own FPU context(s) and handle those...
>
> I'm thinking of having a stack of FPU states to parallel irq stacks
> and IST stacks.

... I'm guessing with the same nesting as hardirqs? Making FPU
instructions usable in irq contexts too.

> It gets a little hairy when code inside kernel_fpu_begin traps for a
> non-irq non-IST reason, though.

How does that happen? You're in the kernel with preemption disabled and
TS cleared, what would cause the #NM? I think that if you need to switch
context, you simply "push" the current FPU context, allocate a new one
and clts as part of the FPU context switching, no?

> Fortunately, those are rare and all of the EX_TABLE users could mark
> xmm regs as clobbered (except for copy_from_user...).

Well, copy_from_user... does a bunch of rep; movsq - if the SSE version
shows reasonable speedup there, we might need to make those work too.

> Keeping kernel_fpu_begin non-preemptable makes it less bad because the
> extra FPU state can be per-cpu and not per-task.

Yep.

> This is extra fun on 32 bit, which IIRC doesn't have IST stacks.
>
> The major speedup will come from saving state in kernel_fpu_begin but
> not restoring it until the code in entry_??.S restores registers.

But you'd need to save each kernel FPU state when nesting, no?

>>> (*) kernel_fpu_begin is a bad name. It's only safe to use integer
>>> instructions inside a kernel_fpu_begin section because MXCSR (and the
>>> 387 equivalent) could contain garbage.
>>
>> Well, do we want to use floating point instructions in the kernel?
>
> The only use I could find is in staging.

Exactly my point - I think we should do it only when it's really worth
the trouble.

--
Regards/Gruss,
Boris.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Doug Anderson: "[PATCH] i2c: tegra: Check for overflow errors with BUG_ON."
Previous message: Emilio G. Cota: "Re: [PATCH 3/5] staging: vme: add functions for bridge modulerefcounting"
In reply to: Andrew Lutomirski: "Re: x86 memcpy performance"
Next in thread: Andrew Lutomirski: "Re: x86 memcpy performance"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]