[PATCH] Fast csum_partial_copy_generic and more

From: kumon@flab.fujitsu.co.jp
Date: Fri May 19 2000 - 08:20:27 EST


Hi,

Here is a patch to speedup csum_partial_copy_generic() on i686 SMP
upto 50%, and also I added some analysis of further optimization from
the view point of SMP cache behavior.

[Patch Summary]
Attatched patch optimizes csum_partial_copy_generic().

Measurement using the WEB-BENCH, the consumption time at the function
is reduced by 33%, so 1.5 times faster than the original.

[Background of csum_partial_copy_generic]
 In the funcion of i686 code, long-word transfer are unfolded 16
times, one loop copies exactly 64 bytes, except the beginning and the
final fraction processing.
 From the observation of the csum_partial_copy_generic behavior, it
produces lots of cache misses, and this is the main reason of
slowness.

[How]
 To accelerate the function, use dummy read as pre-fetch. Of course
pre-fetching must be done only within the accessible area. And the
top word of a cache block should be prefetched, because the CPU gets
the requested word first when the cache is miss-hit, the first word is
always needed earlier than the other word in the block.

To keep track to the block top word, a new pointer is added, which
always points the first word of the cache block which contains the
63th byte of a source block. This pointer never points outside of the
source region, so it is safe to read. This patch doesn't make sense
if the CPU is not a super-scalar type execution.

Strictly speaking, this prefetch may read just after source regionn at
most 3 byte. But it never causes trouble, because this excessive area
and the last transfered byte reside in a same cache block.

[Benchmarking Result]

We used Web-bench as a workload, and measure three versions of
csum_partial_copy_generic.

Version synopsis,
2.3.99-pre8-base: the original kernel. but SLAB_POISONING is disabled
        to compare performance to older kernels. but the older data
        is not attached.
2.3.99-pre8-AS: Add "Artur Skawina <skawina@geocities.com>" patch.
        This pache is also attached at the end of mail.
2.3.99-pre8-Pf: Add my pre-fetch patch.
        This patch also included in this mail.

We obtained the following profile.

The number is average consumption time (unit is us) for one
web-transaction processing.

Machine is 4 SMP Xeon 450MHz 2MB with 2GB, w/o HIMEM,
so actually only1 GB is recognized.

2.3.99-pre8-Pf
        2.3.99-pre8-AS
                2.3.99-pre8-base
990.3 1023.8 1019.3 TOTAL(OS)
 64.9 58.2 54.4 default_idle
 24.8 21.8 20.8 cpu_idle

 63.4 94.8 98.3 csum_partial_copy_generic
 84.2 82.8 82.2 stext_lock
 76.0 76.8 77.7 boomerang_interrupt
 36.6 36.5 36.9 boomerang_rx
 33.4 33.6 34.3 boomerang_start_xmit
 28.6 29.0 29.4 schedule
 22.4 24.2 24.1 kmalloc
 21.5 23.5 22.0 kfree
 19.5 20.2 19.9 tcp_v4_rcv
 18.0 18.4 18.3 wait_for_completion
 17.9 18.3 17.9 __kfree_skb
 16.7 16.5 15.7 __wake_up
 11.0 11.3 10.8 do_IRQ
        rest dropped

Csum_partial_copy_generic becomes 98.3us->63.4us by using Pf patch, so
the patch gained 1.5 times speedup.

Unfortunately, AS version does not show a significant gain. If the
cache is hit,it may show some advantage. But unfortunately, in the
current execution environment, the patch is difficult to hide
cache-miss latency.

By using the user-land benchmark, the new patch also reduce time even
when the source operand is aligned at the cache boundary.

By applying the patch, stext_lock re-appear to the top of time
consumption race. Last time stext_lock lost its position by the
poll() kernel-lock avoidance.

**
 The following shows the break down of stext_lock at tahe prefetched
version. Do_close() is the current worst spin-lock waiter. Do_close()
locks the lock relatively longer time. It tighten the bottleneck by
itself.

        us from where lockvar
        63.4 TOTAL
--------------------------------
        12.1 do_close+144 0xc025b080<-kernel_lock
        8.71 sys_fcntl+142 0xc025b080
        7.65 schedule+1684 0xc025b080
        6.27 sys_open+72 0xc025b080
        6.07 _fput+27 0xc025b080
        5.97 old_mmap+301 0xc025b080
        5.53 sys_newstat+30 0xc025b080
        3.42 tcp_v4_rcv+610 0x2c(%ebx)
        2.95 boomerang_start_xmit+239 0x188(%ebx)
        1.09 tcp_accept+38 0x2c(%esi)
        0.65 wait_for_connect+621 0x2c(%ebp)

At this point, there are some different approaches to reduce the
kernel overhead.

1. Shorten the kernel-lock region in do_close() or others.
2. Reduce cache misses in csum_partial_copy_generic().
3. Use other kind of NIC, because boomerang_interrupt() do too many
   in/out's those add heavy overhead.
   As we've measured eepro100, speedo_interrupt and related functions
   need only a half execution time of boomerang_interrupt and others.

I think, the second point, miss-reduction, is very important.
At first, a user program (apache) provides the network data, then if
csum_partial_copy_generic runs on the same CPU, cache shoud be hit.
Another measurement shows massive 2nd-cache miss-hit occurs at
csum_partial_copy_generic for data load, but why?
The time reduction by the pre-fetching also supports the occurence of
miss-hit.

I assume, the data-provider: caller of write() or send(), and the
data-consumer: csum_partial_copy_generic() are not running on the same
CPUs. If it is true, we can reduce overhead by forcing it on the same
CPU, but I have no idea, how??

The rest is PATCH.

------------------
Prefetch version. (Pf)
------------------

diff -rc linux-2.3.99-pre8/arch/i386/lib/checksum.S linux-2.3.99-pre8-Pf/arch/i386/lib/checksum.S
*** linux-2.3.99-pre8/arch/i386/lib/checksum.S Thu Mar 23 07:23:54 2000
--- linux-2.3.99-pre8-Pf/arch/i386/lib/checksum.S Mon May 15 22:45:41 2000
***************
*** 394,419 ****
          movl ARGBASE+8(%esp),%edi #dst
          movl ARGBASE+12(%esp),%ecx #len
          movl ARGBASE+16(%esp),%eax #sum
! movl %ecx, %edx
          movl %ecx, %ebx
          shrl $6, %ecx
          andl $0x3c, %ebx
          negl %ebx
          subl %ebx, %esi
          subl %ebx, %edi
          lea 3f(%ebx,%ebx), %ebx
          testl %esi, %esi
          jmp *%ebx
  1: addl $64,%esi
          addl $64,%edi
          ROUND1(-64) ROUND(-60) ROUND(-56) ROUND(-52)
          ROUND (-48) ROUND(-44) ROUND(-40) ROUND(-36)
          ROUND (-32) ROUND(-28) ROUND(-24) ROUND(-20)
          ROUND (-16) ROUND(-12) ROUND(-8) ROUND(-4)
  3: adcl $0,%eax
          dec %ecx
          jge 1b
! 4: andl $3, %edx
          jz 7f
          cmpl $2, %edx
          jb 5f
--- 394,425 ----
          movl ARGBASE+8(%esp),%edi #dst
          movl ARGBASE+12(%esp),%ecx #len
          movl ARGBASE+16(%esp),%eax #sum
! # movl %ecx, %edx
          movl %ecx, %ebx
+ movl %esi, %edx
          shrl $6, %ecx
          andl $0x3c, %ebx
          negl %ebx
          subl %ebx, %esi
          subl %ebx, %edi
+ lea -1(%esi),%edx
+ andl $-32,%edx
          lea 3f(%ebx,%ebx), %ebx
          testl %esi, %esi
          jmp *%ebx
  1: addl $64,%esi
          addl $64,%edi
+ SRC(movb -32(%edx),%bl) ; SRC(movb (%edx),%bl)
          ROUND1(-64) ROUND(-60) ROUND(-56) ROUND(-52)
          ROUND (-48) ROUND(-44) ROUND(-40) ROUND(-36)
          ROUND (-32) ROUND(-28) ROUND(-24) ROUND(-20)
          ROUND (-16) ROUND(-12) ROUND(-8) ROUND(-4)
  3: adcl $0,%eax
+ addl $64, %edx
          dec %ecx
          jge 1b
! 4: movl ARGBASE+12(%esp),%edx #len
! andl $3, %edx
          jz 7f
          cmpl $2, %edx
          jb 5f

------------------
Artur Skawina patch. (AS)
------------------
diff -urNp /img/linux-2.3.99pre6pre5/arch/i386/lib/checksum.S linux-2.3.99pre6pre5as/arch/i386/lib/checksum.S
--- /img/linux-2.3.99pre6pre5/arch/i386/lib/checksum.S Wed Mar 29 20:53:25 2000
+++ linux-2.3.99pre6pre5as/arch/i386/lib/checksum.S Sat Apr 22 10:43:28 2000
@@ -374,81 +373,119 @@ DST( movb %cl, (%edi) )
 
 /* Version for PentiumII/PPro */
 
+/*
+ This is
+ o 70% slower when the source is not 32 bit aligned [ie (long)src&3]
+ o 190% slower when the destination is not 32 bit aligned
+ o 260% slower when both source and destination are not 32 bit aligned
+ o 175% slower when destination is not 64 bit aligned and source _is_ [ie (long)dst&4]
+ o whether source is 64 bit aligned or not does not seem to make much difference
+ */
+
 #define ROUND1(x) \
- SRC(movl x(%esi), %ebx ) ; \
- addl %ebx, %eax ; \
- DST(movl %ebx, x(%edi) ) ;
+ SRC(movl x(%esi), %edx ) ;\
+ addl %edx, %eax ;\
+ SRC(movl x+4(%esi), %ebx ) ;\
+ DST(movl %edx, x(%edi) ) ;\
+ adcl %ebx, %eax ;\
+ DST(movl %ebx, x+4(%edi) ) ;\
 
 #define ROUND(x) \
- SRC(movl x(%esi), %ebx ) ; \
- adcl %ebx, %eax ; \
- DST(movl %ebx, x(%edi) ) ;
+ SRC(movl x(%esi), %edx ) ;\
+ adcl %edx, %eax ;\
+ SRC(movl x+4(%esi), %ebx ) ;\
+ DST(movl %edx, x(%edi) ) ;\
+ adcl %ebx, %eax ;\
+ DST(movl %ebx, x+4(%edi) ) ;\
+
+#define ROUNDL(x) \
+ SRC(movl x(%esi), %edx ) ;\
+ adcl %edx, %eax ;\
+ SRC(movl x+4(%esi), %ebx ) ;\
+ adcl %ebx, %eax ;\
+ DST(movl %edx, x(%edi) ) ;\
+ DST(movl %ebx, x+4(%edi) ) ;\
 
 #define ARGBASE 12
                 
 csum_partial_copy_generic:
         pushl %ebx
- pushl %edi
+ movl ARGBASE+12-4*2(%esp),%ebx #len
         pushl %esi
- movl ARGBASE+4(%esp),%esi #src
- movl ARGBASE+8(%esp),%edi #dst
- movl ARGBASE+12(%esp),%ecx #len
- movl ARGBASE+16(%esp),%eax #sum
- movl %ecx, %edx
- movl %ecx, %ebx
- shrl $6, %ecx
- andl $0x3c, %ebx
+ movl ARGBASE+4-4*1(%esp),%esi #src
+ movl %ebx, %ecx
+ pushl %edi
+ movl ARGBASE+8-4*0(%esp),%edi #dst
+ andl $0x38, %ebx
+ addl %ebx, %esi
+ shrl $6, %ecx # len /= 64 (number of longwords per iteration)
+ addl %ebx, %edi
         negl %ebx
- subl %ebx, %esi
- subl %ebx, %edi
+ movl ARGBASE+16-4*0(%esp),%eax #sum
         lea 3f(%ebx,%ebx), %ebx
- testl %esi, %esi
+ testl %eax,%eax # CF=0
         jmp *%ebx
-1: addl $64,%esi
+1:
+ addl $64,%esi
         addl $64,%edi
- ROUND1(-64) ROUND(-60) ROUND(-56) ROUND(-52)
- ROUND (-48) ROUND(-44) ROUND(-40) ROUND(-36)
- ROUND (-32) ROUND(-28) ROUND(-24) ROUND(-20)
- ROUND (-16) ROUND(-12) ROUND(-8) ROUND(-4)
-3: adcl $0,%eax
+ ROUND1(-64) ROUND (-56)
+ ROUND (-48) ROUND (-40)
+ ROUND (-32) ROUND (-24)
+ ROUND (-16) ROUNDL(-8)
+3:
+ adcl $0,%eax
         dec %ecx
         jge 1b
-4: andl $3, %edx
+
+ movl ARGBASE+12(%esp),%edx #len
+
+ testl $4,%edx
+ jz 4f
+ SRC(movl (%esi), %ebx )
+ addl %ebx, %eax
+ DST(movl %ebx, (%edi) )
+ leal 4(%esi), %esi
+ leal 4(%edi), %edi
+ adcl $0, %eax
+4:
+ andl $3, %edx
         jz 7f
         cmpl $2, %edx
         jb 5f
 SRC( movw (%esi), %dx )
- leal 2(%esi), %esi
 DST( movw %dx, (%edi) )
- leal 2(%edi), %edi
         je 6f
+ leal 2(%esi), %esi
         shll $16,%edx
+ leal 2(%edi), %edi
 5:
 SRC( movb (%esi), %dl )
 DST( movb %dl, (%edi) )
-6: addl %edx, %eax
+6:
+ addl %edx, %eax
         adcl $0, %eax
 7:
 .section .fixup, "ax"
-6001: movl ARGBASE+20(%esp), %ebx # src_err_ptr
- movl $-EFAULT, (%ebx)
+6001:
         # zero the complete destination (computing the rest is too much work)
         movl ARGBASE+8(%esp),%edi # dst
         movl ARGBASE+12(%esp),%ecx # len
+ movl ARGBASE+20(%esp), %ebx # src_err_ptr
         xorl %eax,%eax
+ movl $-EFAULT, (%ebx)
         rep; stosb
- jmp 7b
+ jmp 7b
 6002: movl ARGBASE+24(%esp), %ebx # dst_err_ptr
         movl $-EFAULT, (%ebx)
         jmp 7b
 .previous
 
- popl %esi
         popl %edi
+ popl %esi
         popl %ebx
         ret
                                 
 #undef ROUND
 #undef ROUND1
-
+
 #endif

--
Computer Systems Laboratory, Fujitsu Labs.
kumon@flab.fujitsu.co.jp

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Tue May 23 2000 - 21:00:17 EST