Re: [PATCH] zerocopy NFS for 2.5.36

From: Andrew Morton (
Date: Wed Sep 18 2002 - 19:16:43 EST

Alan Cox wrote:
> On Thu, 2002-09-19 at 00:00, David S. Miller wrote:
> > It was discussed long ago that csum_and_copy_from_user() performs
> > better than plain copy_from_user() on x86. I do not remember all
> The better was a freak of PPro/PII scheduling I think
> > details, but I do know that using copy_from_user() is not a real
> > improvement at least on x86 architecture.
> The same as bit is easy to explain. Its totally memory bandwidth limited
> on current x86-32 processors. (Although I'd welcome demonstrations to
> the contrary on newer toys)

Nope. There are distinct alignment problems with movsl-based
memcpy on PII and (at least) "Pentium III (Coppermine)", which is
tested here:

copy_32 uses movsl. copy_duff just uses a stream of "movl"s

Time uncached-to-uncached memcpy, source and dest are 8-byte-aligned:

akpm:/usr/src/cptimer> ./cptimer -d -s
nbytes=10240 from_align=0, to_align=0
    copy_32: copied 19.1 Mbytes in 0.078 seconds at 243.9 Mbytes/sec
__copy_duff: copied 19.1 Mbytes in 0.090 seconds at 211.1 Mbytes/sec

OK, movsl wins. But now give the source address 8+1 alignment:

akpm:/usr/src/cptimer> ./cptimer -d -s -f 1
nbytes=10240 from_align=1, to_align=0
    copy_32: copied 19.1 Mbytes in 0.158 seconds at 120.8 Mbytes/sec
__copy_duff: copied 19.1 Mbytes in 0.091 seconds at 210.3 Mbytes/sec

The "movl"-based copy wins. By miles.

Make the source 8+4 aligned:

akpm:/usr/src/cptimer> ./cptimer -d -s -f 4
nbytes=10240 from_align=4, to_align=0
    copy_32: copied 19.1 Mbytes in 0.134 seconds at 142.1 Mbytes/sec
__copy_duff: copied 19.1 Mbytes in 0.089 seconds at 214.0 Mbytes/sec

So movl still beats movsl, by lots.

I have various scriptlets which generate the entire matrix.

I think I ended up deciding that we should use movsl _only_
when both src and dsc are 8-byte-aligned. And that when you
multiply the gain from that by the frequency*size with which
funny alignments are used by TCP the net gain was 2% or something.

It needs redoing. These differences are really big, and this
is the kernel's most expensive function.

A little project for someone.

The tools are at
This archive was generated by hypermail 2b29 : Mon Sep 23 2002 - 22:00:24 EST