I also improved a bit the asm (mainly the csum_copy that wasn't
unrolling enough according to the new numbers).
sz[0] means that the csum is been run on a buffer with size == 0 byte.
2.3.3 csum_partial:
measure_csum_partial
csum_partial: sz[0] 10 iterations takes 291 microseconds
csum_partial: sz[0] 1 iteration takes <29 microseconds>==<5 nanoseconds>
csum_partial: sz[1] 10 iterations takes 344 microseconds
csum_partial: sz[1] 1 iteration takes <34 microseconds>==<6 nanoseconds>
csum_partial: sz[2] 10 iterations takes 344 microseconds
csum_partial: sz[2] 1 iteration takes <34 microseconds>==<6 nanoseconds>
csum_partial: sz[3] 10 iterations takes 344 microseconds
csum_partial: sz[3] 1 iteration takes <34 microseconds>==<6 nanoseconds>
csum_partial: sz[4] 10 iterations takes 311 microseconds
csum_partial: sz[4] 1 iteration takes <31 microseconds>==<6 nanoseconds>
csum_partial: sz[5] 10 iterations takes 363 microseconds
csum_partial: sz[5] 1 iteration takes <36 microseconds>==<7 nanoseconds>
csum_partial: sz[6] 10 iterations takes 363 microseconds
csum_partial: sz[6] 1 iteration takes <36 microseconds>==<7 nanoseconds>
csum_partial: sz[7] 10 iterations takes 362 microseconds
csum_partial: sz[7] 1 iteration takes <36 microseconds>==<7 nanoseconds>
csum_partial: sz[8] 10 iterations takes 327 microseconds
csum_partial: sz[8] 1 iteration takes <32 microseconds>==<6 nanoseconds>
csum_partial: sz[9] 10 iterations takes 382 microseconds
csum_partial: sz[9] 1 iteration takes <38 microseconds>==<7 nanoseconds>
csum_partial: sz[1024] 10 iterations takes 6427 microseconds
csum_partial: sz[1024] 1 iteration takes <642 microseconds>==<125 nanoseconds>
csum_partial: sz[1025] 10 iterations takes 6465 microseconds
csum_partial: sz[1025] 1 iteration takes <646 microseconds>==<126 nanoseconds>
csum_partial: sz[1026] 10 iterations takes 6463 microseconds
csum_partial: sz[1026] 1 iteration takes <646 microseconds>==<126 nanoseconds>
csum_partial: sz[1027] 10 iterations takes 6461 microseconds
csum_partial: sz[1027] 1 iteration takes <646 microseconds>==<126 nanoseconds>
csum_partial: sz[1028] 10 iterations takes 6361 microseconds
csum_partial: sz[1028] 1 iteration takes <636 microseconds>==<124 nanoseconds>
csum_partial: sz[1029] 10 iterations takes 6455 microseconds
csum_partial: sz[1029] 1 iteration takes <645 microseconds>==<125 nanoseconds>
csum_partial: sz[1030] 10 iterations takes 6438 microseconds
csum_partial: sz[1030] 1 iteration takes <643 microseconds>==<125 nanoseconds>
csum_partial: sz[1031] 10 iterations takes 6441 microseconds
csum_partial: sz[1031] 1 iteration takes <644 microseconds>==<125 nanoseconds>
csum_partial: sz[1032] 10 iterations takes 6451 microseconds
csum_partial: sz[1032] 1 iteration takes <645 microseconds>==<125 nanoseconds>
csum_partial: sz[1033] 10 iterations takes 6464 microseconds
csum_partial: sz[1033] 1 iteration takes <646 microseconds>==<126 nanoseconds>
2.3.3 csum_partial + my patch applyed:
measure_csum_partial
csum_partial: sz[0] 10 iterations takes 279 microseconds
csum_partial: sz[0] 1 iteration takes <27 microseconds>==<5 nanoseconds>
csum_partial: sz[1] 10 iterations takes 316 microseconds
csum_partial: sz[1] 1 iteration takes <31 microseconds>==<6 nanoseconds>
csum_partial: sz[2] 10 iterations takes 410 microseconds
csum_partial: sz[2] 1 iteration takes <41 microseconds>==<8 nanoseconds>
csum_partial: sz[3] 10 iterations takes 444 microseconds
csum_partial: sz[3] 1 iteration takes <44 microseconds>==<8 nanoseconds>
csum_partial: sz[4] 10 iterations takes 278 microseconds
csum_partial: sz[4] 1 iteration takes <27 microseconds>==<5 nanoseconds>
csum_partial: sz[5] 10 iterations takes 337 microseconds
csum_partial: sz[5] 1 iteration takes <33 microseconds>==<6 nanoseconds>
csum_partial: sz[6] 10 iterations takes 487 microseconds
csum_partial: sz[6] 1 iteration takes <48 microseconds>==<9 nanoseconds>
csum_partial: sz[7] 10 iterations takes 455 microseconds
csum_partial: sz[7] 1 iteration takes <45 microseconds>==<8 nanoseconds>
csum_partial: sz[8] 10 iterations takes 289 microseconds
csum_partial: sz[8] 1 iteration takes <28 microseconds>==<5 nanoseconds>
csum_partial: sz[9] 10 iterations takes 348 microseconds
csum_partial: sz[9] 1 iteration takes <34 microseconds>==<6 nanoseconds>
csum_partial: sz[1024] 10 iterations takes 6447 microseconds
csum_partial: sz[1024] 1 iteration takes <644 microseconds>==<125 nanoseconds>
csum_partial: sz[1025] 10 iterations takes 6389 microseconds
csum_partial: sz[1025] 1 iteration takes <638 microseconds>==<124 nanoseconds>
csum_partial: sz[1026] 10 iterations takes 6548 microseconds
csum_partial: sz[1026] 1 iteration takes <654 microseconds>==<127 nanoseconds>
csum_partial: sz[1027] 10 iterations takes 6509 microseconds
csum_partial: sz[1027] 1 iteration takes <650 microseconds>==<126 nanoseconds>
csum_partial: sz[1028] 10 iterations takes 6383 microseconds
csum_partial: sz[1028] 1 iteration takes <638 microseconds>==<124 nanoseconds>
csum_partial: sz[1029] 10 iterations takes 6429 microseconds
csum_partial: sz[1029] 1 iteration takes <642 microseconds>==<125 nanoseconds>
csum_partial: sz[1030] 10 iterations takes 6521 microseconds
csum_partial: sz[1030] 1 iteration takes <652 microseconds>==<127 nanoseconds>
csum_partial: sz[1031] 10 iterations takes 6518 microseconds
csum_partial: sz[1031] 1 iteration takes <651 microseconds>==<127 nanoseconds>
csum_partial: sz[1032] 10 iterations takes 6419 microseconds
csum_partial: sz[1032] 1 iteration takes <641 microseconds>==<125 nanoseconds>
csum_partial: sz[1033] 10 iterations takes 6463 microseconds
csum_partial: sz[1033] 1 iteration takes <646 microseconds>==<126 nanoseconds>
csum_partial_copy_generic from 2.3.3 stock:
measure_csum_partial_copy
csum_partial_copy: sz[0] 10 iterations takes 380 microseconds
csum_partial_copy: sz[0] 1 iteration takes <38 microseconds>==<7 nanoseconds>
csum_partial_copy: sz[1] 10 iterations takes 479 microseconds
csum_partial_copy: sz[1] 1 iteration takes <47 microseconds>==<9 nanoseconds>
csum_partial_copy: sz[2] 10 iterations takes 549 microseconds
csum_partial_copy: sz[2] 1 iteration takes <54 microseconds>==<10 nanoseconds>
csum_partial_copy: sz[3] 10 iterations takes 615 microseconds
csum_partial_copy: sz[3] 1 iteration takes <61 microseconds>==<11 nanoseconds>
csum_partial_copy: sz[4] 10 iterations takes 369 microseconds
csum_partial_copy: sz[4] 1 iteration takes <36 microseconds>==<7 nanoseconds>
csum_partial_copy: sz[5] 10 iterations takes 490 microseconds
csum_partial_copy: sz[5] 1 iteration takes <49 microseconds>==<9 nanoseconds>
csum_partial_copy: sz[6] 10 iterations takes 489 microseconds
csum_partial_copy: sz[6] 1 iteration takes <48 microseconds>==<9 nanoseconds>
csum_partial_copy: sz[7] 10 iterations takes 604 microseconds
csum_partial_copy: sz[7] 1 iteration takes <60 microseconds>==<11 nanoseconds>
csum_partial_copy: sz[8] 10 iterations takes 398 microseconds
csum_partial_copy: sz[8] 1 iteration takes <39 microseconds>==<7 nanoseconds>
csum_partial_copy: sz[9] 10 iterations takes 569 microseconds
csum_partial_copy: sz[9] 1 iteration takes <56 microseconds>==<10 nanoseconds>
csum_partial_copy: sz[1024] 10 iterations takes 8445 microseconds
csum_partial_copy: sz[1024] 1 iteration takes <844 microseconds>==<164 nanoseconds>
csum_partial_copy: sz[1025] 10 iterations takes 8522 microseconds
csum_partial_copy: sz[1025] 1 iteration takes <852 microseconds>==<166 nanoseconds>
csum_partial_copy: sz[1026] 10 iterations takes 8521 microseconds
csum_partial_copy: sz[1026] 1 iteration takes <852 microseconds>==<166 nanoseconds>
csum_partial_copy: sz[1027] 10 iterations takes 8597 microseconds
csum_partial_copy: sz[1027] 1 iteration takes <859 microseconds>==<167 nanoseconds>
csum_partial_copy: sz[1028] 10 iterations takes 8460 microseconds
csum_partial_copy: sz[1028] 1 iteration takes <846 microseconds>==<165 nanoseconds>
csum_partial_copy: sz[1029] 10 iterations takes 8552 microseconds
csum_partial_copy: sz[1029] 1 iteration takes <855 microseconds>==<166 nanoseconds>
csum_partial_copy: sz[1030] 10 iterations takes 8565 microseconds
csum_partial_copy: sz[1030] 1 iteration takes <856 microseconds>==<167 nanoseconds>
csum_partial_copy: sz[1031] 10 iterations takes 8644 microseconds
csum_partial_copy: sz[1031] 1 iteration takes <864 microseconds>==<168 nanoseconds>
csum_partial_copy: sz[1032] 10 iterations takes 8473 microseconds
csum_partial_copy: sz[1032] 1 iteration takes <847 microseconds>==<165 nanoseconds>
csum_partial_copy: sz[1033] 10 iterations takes 8564 microseconds
csum_partial_copy: sz[1033] 1 iteration takes <856 microseconds>==<167 nanoseconds>
csum_partial_copy_generic + my patch applyed:
measure_csum_partial_copy
csum_partial_copy: sz[0] 10 iterations takes 362 microseconds
csum_partial_copy: sz[0] 1 iteration takes <36 microseconds>==<7 nanoseconds>
csum_partial_copy: sz[1] 10 iterations takes 501 microseconds
csum_partial_copy: sz[1] 1 iteration takes <50 microseconds>==<9 nanoseconds>
csum_partial_copy: sz[2] 10 iterations takes 490 microseconds
csum_partial_copy: sz[2] 1 iteration takes <49 microseconds>==<9 nanoseconds>
csum_partial_copy: sz[3] 10 iterations takes 581 microseconds
csum_partial_copy: sz[3] 1 iteration takes <58 microseconds>==<11 nanoseconds>
csum_partial_copy: sz[4] 10 iterations takes 377 microseconds
csum_partial_copy: sz[4] 1 iteration takes <37 microseconds>==<7 nanoseconds>
csum_partial_copy: sz[5] 10 iterations takes 524 microseconds
csum_partial_copy: sz[5] 1 iteration takes <52 microseconds>==<10 nanoseconds>
csum_partial_copy: sz[6] 10 iterations takes 501 microseconds
csum_partial_copy: sz[6] 1 iteration takes <50 microseconds>==<9 nanoseconds>
csum_partial_copy: sz[7] 10 iterations takes 625 microseconds
csum_partial_copy: sz[7] 1 iteration takes <62 microseconds>==<12 nanoseconds>
csum_partial_copy: sz[8] 10 iterations takes 405 microseconds
csum_partial_copy: sz[8] 1 iteration takes <40 microseconds>==<7 nanoseconds>
csum_partial_copy: sz[9] 10 iterations takes 559 microseconds
csum_partial_copy: sz[9] 1 iteration takes <55 microseconds>==<10 nanoseconds>
csum_partial_copy: sz[1024] 10 iterations takes 6580 microseconds
csum_partial_copy: sz[1024] 1 iteration takes <658 microseconds>==<128 nanoseconds>
csum_partial_copy: sz[1025] 10 iterations takes 6684 microseconds
csum_partial_copy: sz[1025] 1 iteration takes <668 microseconds>==<130 nanoseconds>
csum_partial_copy: sz[1026] 10 iterations takes 6684 microseconds
csum_partial_copy: sz[1026] 1 iteration takes <668 microseconds>==<130 nanoseconds>
csum_partial_copy: sz[1027] 10 iterations takes 6758 microseconds
csum_partial_copy: sz[1027] 1 iteration takes <675 microseconds>==<131 nanoseconds>
csum_partial_copy: sz[1028] 10 iterations takes 6650 microseconds
csum_partial_copy: sz[1028] 1 iteration takes <665 microseconds>==<129 nanoseconds>
csum_partial_copy: sz[1029] 10 iterations takes 6702 microseconds
csum_partial_copy: sz[1029] 1 iteration takes <670 microseconds>==<130 nanoseconds>
csum_partial_copy: sz[1030] 10 iterations takes 6703 microseconds
csum_partial_copy: sz[1030] 1 iteration takes <670 microseconds>==<130 nanoseconds>
csum_partial_copy: sz[1031] 10 iterations takes 6780 microseconds
csum_partial_copy: sz[1031] 1 iteration takes <678 microseconds>==<132 nanoseconds>
csum_partial_copy: sz[1032] 10 iterations takes 6613 microseconds
csum_partial_copy: sz[1032] 1 iteration takes <661 microseconds>==<129 nanoseconds>
csum_partial_copy: sz[1033] 10 iterations takes 6714 microseconds
csum_partial_copy: sz[1033] 1 iteration takes <671 microseconds>==<131 nanoseconds>
Patch against 2.3.3:
Index: linux/arch/i386/lib/checksum.S
===================================================================
RCS file: /var/cvs/linux/arch/i386/lib/checksum.S,v
retrieving revision 1.1.1.3
diff -u -r1.1.1.3 checksum.S
--- linux/arch/i386/lib/checksum.S 1999/05/18 20:17:13 1.1.1.3
+++ linux/arch/i386/lib/checksum.S 1999/05/24 01:40:00
@@ -18,6 +18,8 @@
* handling.
* Andi Kleen, add zeroing on error
* converted to pure assembler
+ * Andrea Arcangeli, fixed a potential buffer overflow in the
+ * 686 csum_partial, and some improvement in the 686 code.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
@@ -123,9 +125,6 @@
movl 16(%esp),%ecx # Function arg: int len
movl 12(%esp),%esi # Function arg: const unsigned char *buf
- testl $2, %esi
- jnz 30f
-10:
movl %ecx, %edx
movl %ecx, %ebx
andl $0x7c, %ebx
@@ -134,27 +133,9 @@
shrl $2, %ebx
negl %ebx
lea 45f(%ebx,%ebx,2), %ebx
- testl %esi, %esi
+ clc
jmp *%ebx
- # Handle 2-byte-aligned regions
-20: addw (%esi), %ax
- lea 2(%esi), %esi
- adcl $0, %eax
- jmp 10b
-
-30: subl $2, %ecx
- ja 20b
- je 32f
- movzbl (%esi),%ebx # csumming 1 byte, 2-aligned
- addl %ebx, %eax
- adcl $0, %eax
- jmp 80f
-32:
- addw (%esi), %ax # csumming 2 bytes, 2-aligned
- adcl $0, %eax
- jmp 80f
-
40:
addl -128(%esi), %eax
adcl -124(%esi), %eax
@@ -193,18 +174,20 @@
adcl $0, %eax
dec %ecx
jge 40b
- movl %edx, %ecx
-50: andl $3, %ecx
+50:
+ testl $3, %edx
jz 80f
-
- # Handle the last 1-3 bytes without jumping
- notl %ecx # 1->2, 2->1, 3->0, higher bits are masked
- movl $0xffffff,%ebx # by the shll and shrl instructions
- shll $3,%ecx
- shrl %cl,%ebx
- andl -128(%esi),%ebx # esi is 4-aligned so should be ok
- addl %ebx,%eax
+ testl $2, %edx
+ jz 70f
+ addw -128(%esi), %ax
+ lea 2(%esi), %esi
adcl $0,%eax
+70:
+ testl $1, %edx
+ jz 80f
+ movzbl -128(%esi), %ebx
+ addl %ebx, %eax
+ adcl $0, %eax
80:
popl %ebx
popl %esi
@@ -393,36 +376,44 @@
movl ARGBASE+16(%esp),%eax #sum
movl %ecx, %edx
movl %ecx, %ebx
- shrl $6, %ecx
- andl $0x3c, %ebx
+ andl $0x7c, %ebx
+ shrl $7, %ecx
+ addl %ebx,%esi
+ addl %ebx,%edi
+ shrl $2, %ebx
negl %ebx
- subl %ebx, %esi
- subl %ebx, %edi
- lea 3f(%ebx,%ebx), %ebx
- testl %esi, %esi
+ shll $3, %ebx
+ leal 3f(%ebx), %ebx
+ clc
jmp *%ebx
-1: addl $64,%esi
- addl $64,%edi
- ROUND1(-64) ROUND(-60) ROUND(-56) ROUND(-52)
+1:
+ ROUND1(-128) ROUND(-124) ROUND(-120) ROUND(-116)
+ ROUND (-112) ROUND(-108) ROUND(-104) ROUND(-100)
+ ROUND (-96) ROUND(-92) ROUND(-88) ROUND(-84)
+ ROUND (-80) ROUND(-76) ROUND(-72) ROUND(-68)
+ ROUND (-64) ROUND(-60) ROUND(-56) ROUND(-52)
ROUND (-48) ROUND(-44) ROUND(-40) ROUND(-36)
ROUND (-32) ROUND(-28) ROUND(-24) ROUND(-20)
ROUND (-16) ROUND(-12) ROUND(-8) ROUND(-4)
-3: adcl $0,%eax
+3:
+ leal 128(%esi), %esi
+ adcl $0,%eax
+ leal 128(%edi), %edi
dec %ecx
jge 1b
4: andl $3, %edx
jz 7f
cmpl $2, %edx
jb 5f
-SRC( movw (%esi), %dx )
+SRC( movw -128(%esi), %dx )
leal 2(%esi), %esi
-DST( movw %dx, (%edi) )
+DST( movw %dx, -128(%edi) )
leal 2(%edi), %edi
je 6f
shll $16,%edx
5:
-SRC( movb (%esi), %dl )
-DST( movb %dl, (%edi) )
+SRC( movb -128(%esi), %dl )
+DST( movb %dl, -128(%edi) )
6: addl %edx, %eax
adcl $0, %eax
7:
Andrea Arcangeli
PS. Yesterday I played a bit with mmx (ignoring the fsave cost that I
think could be reduced near zero by using the TS bit of the cr0) and it
seems impossible to me to get better peformances than the current no-mmx
cksum. The real problem is that MMX has no carry-capable instructions. I
have not tried with SIMD though (mainly because my cpus are not SIMD
capable).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/