Re: x86 memcpy performance

From: Borislav Petkov
Date: Thu Sep 08 2011 - 04:36:06 EST

Next message: Christoph Hellwig: "Re: [PATCH] random: add blocking facility to urandom"
Previous message: KAMEZAWA Hiroyuki: "Re: [patch] mm: memcg: close race between charge and putback"
Next in thread: Maarten Lankhorst: "Re: x86 memcpy performance"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Sep 01, 2011 at 09:18:32AM -0700, Linus Torvalds wrote:
> On Thu, Sep 1, 2011 at 8:15 AM, Maarten Lankhorst
> <m.b.lankhorst@xxxxxxxxx> wrote:
> >
> > This work intrigued me, in some cases kernel memcpy was a lot faster than sse memcpy,
> > and I finally figured out why. I also extended the test to an optimized avx memcpy,
> > but I think the kernel memcpy will always win in the aligned case.
>
> "rep movs" is generally optimized in microcode on most modern Intel
> CPU's for some easyish cases, and it will outperform just about
> anything.
>
> Atom is a notable exception, but if you expect performance on any
> general loads from Atom, you need to get your head examined. Atom is a
> disaster for anything but tuned loops.
>
> The "easyish cases" depend on microarchitecture. They are improving,
> so long-term "rep movs" is the best way regardless, but for most
> current ones it's something like "source aligned to 8 bytes *and*
> source and destination are equal "mod 64"".
>
> And that's true in a lot of common situations. It's true for the page
> copy, for example, and it's often true for big user "read()/write()"
> calls (but "often" may not be "often enough" - high-performance
> userland should strive to align read/write buffers to 64 bytes, for
> example).
>
> Many other cases of "memcpy()" are the fairly small, constant-sized
> ones, where the optimal strategy tends to be "move words by hand".

Yeah,

this probably makes enabling SSE memcpy in the kernel a task
with diminishing returns. There are also the additional costs of
saving/restoring FPU context in the kernel which eat off from any SSE
speedup.

And then there's the additional I$ pressure because "rep movs" is
much smaller than all those mov[au]ps stanzas. Btw, mov[au]ps are the
smallest (two-byte) instructions I could use - in the AVX case they can
get up to 4 Bytes of length with the VEX prefix and the additional SIB,
size override, etc. fields.

Oh, and then there's copy_*_user which also does fault handling and
replacing that with a SSE version of memcpy could get quite hairy quite
fast.

Anyway, I'll try to benchmark an asm version of SSE memcpy in the kernel
when I get the time to see whether it still makes sense, at all.

Thanks.

--
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Christoph Hellwig: "Re: [PATCH] random: add blocking facility to urandom"
Previous message: KAMEZAWA Hiroyuki: "Re: [patch] mm: memcg: close race between charge and putback"
Next in thread: Maarten Lankhorst: "Re: x86 memcpy performance"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]