[patch] checksum P6 asm buffer overflow fix + 686 improvements

Andrea Arcangeli (andrea@suse.de)
Mon, 24 May 1999 03:47:26 +0200 (CEST)


In the 686 asm of csum_partial there is an andl done on a memory address
that as best will contains only 3 valid bytes. As far as the four byte is
in the 4mbyte pagetable with the kernel there is no problem (since the
four byte is then ignored by masking) but in the unlikely case that the
first one/two/three bytes are the last bytes of a vmalloced area, then
accessing the second/third/four byte will generate an Oops.

I also improved a bit the asm (mainly the csum_copy that wasn't
unrolling enough according to the new numbers).

sz[0] means that the csum is been run on a buffer with size == 0 byte.

2.3.3 csum_partial:

measure_csum_partial
csum_partial: sz[0] 10 iterations takes 291 microseconds
csum_partial: sz[0] 1 iteration takes <29 microseconds>==<5 nanoseconds>
csum_partial: sz[1] 10 iterations takes 344 microseconds
csum_partial: sz[1] 1 iteration takes <34 microseconds>==<6 nanoseconds>
csum_partial: sz[2] 10 iterations takes 344 microseconds
csum_partial: sz[2] 1 iteration takes <34 microseconds>==<6 nanoseconds>
csum_partial: sz[3] 10 iterations takes 344 microseconds
csum_partial: sz[3] 1 iteration takes <34 microseconds>==<6 nanoseconds>
csum_partial: sz[4] 10 iterations takes 311 microseconds
csum_partial: sz[4] 1 iteration takes <31 microseconds>==<6 nanoseconds>
csum_partial: sz[5] 10 iterations takes 363 microseconds
csum_partial: sz[5] 1 iteration takes <36 microseconds>==<7 nanoseconds>
csum_partial: sz[6] 10 iterations takes 363 microseconds
csum_partial: sz[6] 1 iteration takes <36 microseconds>==<7 nanoseconds>
csum_partial: sz[7] 10 iterations takes 362 microseconds
csum_partial: sz[7] 1 iteration takes <36 microseconds>==<7 nanoseconds>
csum_partial: sz[8] 10 iterations takes 327 microseconds
csum_partial: sz[8] 1 iteration takes <32 microseconds>==<6 nanoseconds>
csum_partial: sz[9] 10 iterations takes 382 microseconds
csum_partial: sz[9] 1 iteration takes <38 microseconds>==<7 nanoseconds>
csum_partial: sz[1024] 10 iterations takes 6427 microseconds
csum_partial: sz[1024] 1 iteration takes <642 microseconds>==<125 nanoseconds>
csum_partial: sz[1025] 10 iterations takes 6465 microseconds
csum_partial: sz[1025] 1 iteration takes <646 microseconds>==<126 nanoseconds>
csum_partial: sz[1026] 10 iterations takes 6463 microseconds
csum_partial: sz[1026] 1 iteration takes <646 microseconds>==<126 nanoseconds>
csum_partial: sz[1027] 10 iterations takes 6461 microseconds
csum_partial: sz[1027] 1 iteration takes <646 microseconds>==<126 nanoseconds>
csum_partial: sz[1028] 10 iterations takes 6361 microseconds
csum_partial: sz[1028] 1 iteration takes <636 microseconds>==<124 nanoseconds>
csum_partial: sz[1029] 10 iterations takes 6455 microseconds
csum_partial: sz[1029] 1 iteration takes <645 microseconds>==<125 nanoseconds>
csum_partial: sz[1030] 10 iterations takes 6438 microseconds
csum_partial: sz[1030] 1 iteration takes <643 microseconds>==<125 nanoseconds>
csum_partial: sz[1031] 10 iterations takes 6441 microseconds
csum_partial: sz[1031] 1 iteration takes <644 microseconds>==<125 nanoseconds>
csum_partial: sz[1032] 10 iterations takes 6451 microseconds
csum_partial: sz[1032] 1 iteration takes <645 microseconds>==<125 nanoseconds>
csum_partial: sz[1033] 10 iterations takes 6464 microseconds
csum_partial: sz[1033] 1 iteration takes <646 microseconds>==<126 nanoseconds>

2.3.3 csum_partial + my patch applyed:

measure_csum_partial
csum_partial: sz[0] 10 iterations takes 279 microseconds
csum_partial: sz[0] 1 iteration takes <27 microseconds>==<5 nanoseconds>
csum_partial: sz[1] 10 iterations takes 316 microseconds
csum_partial: sz[1] 1 iteration takes <31 microseconds>==<6 nanoseconds>
csum_partial: sz[2] 10 iterations takes 410 microseconds
csum_partial: sz[2] 1 iteration takes <41 microseconds>==<8 nanoseconds>
csum_partial: sz[3] 10 iterations takes 444 microseconds
csum_partial: sz[3] 1 iteration takes <44 microseconds>==<8 nanoseconds>
csum_partial: sz[4] 10 iterations takes 278 microseconds
csum_partial: sz[4] 1 iteration takes <27 microseconds>==<5 nanoseconds>
csum_partial: sz[5] 10 iterations takes 337 microseconds
csum_partial: sz[5] 1 iteration takes <33 microseconds>==<6 nanoseconds>
csum_partial: sz[6] 10 iterations takes 487 microseconds
csum_partial: sz[6] 1 iteration takes <48 microseconds>==<9 nanoseconds>
csum_partial: sz[7] 10 iterations takes 455 microseconds
csum_partial: sz[7] 1 iteration takes <45 microseconds>==<8 nanoseconds>
csum_partial: sz[8] 10 iterations takes 289 microseconds
csum_partial: sz[8] 1 iteration takes <28 microseconds>==<5 nanoseconds>
csum_partial: sz[9] 10 iterations takes 348 microseconds
csum_partial: sz[9] 1 iteration takes <34 microseconds>==<6 nanoseconds>
csum_partial: sz[1024] 10 iterations takes 6447 microseconds
csum_partial: sz[1024] 1 iteration takes <644 microseconds>==<125 nanoseconds>
csum_partial: sz[1025] 10 iterations takes 6389 microseconds
csum_partial: sz[1025] 1 iteration takes <638 microseconds>==<124 nanoseconds>
csum_partial: sz[1026] 10 iterations takes 6548 microseconds
csum_partial: sz[1026] 1 iteration takes <654 microseconds>==<127 nanoseconds>
csum_partial: sz[1027] 10 iterations takes 6509 microseconds
csum_partial: sz[1027] 1 iteration takes <650 microseconds>==<126 nanoseconds>
csum_partial: sz[1028] 10 iterations takes 6383 microseconds
csum_partial: sz[1028] 1 iteration takes <638 microseconds>==<124 nanoseconds>
csum_partial: sz[1029] 10 iterations takes 6429 microseconds
csum_partial: sz[1029] 1 iteration takes <642 microseconds>==<125 nanoseconds>
csum_partial: sz[1030] 10 iterations takes 6521 microseconds
csum_partial: sz[1030] 1 iteration takes <652 microseconds>==<127 nanoseconds>
csum_partial: sz[1031] 10 iterations takes 6518 microseconds
csum_partial: sz[1031] 1 iteration takes <651 microseconds>==<127 nanoseconds>
csum_partial: sz[1032] 10 iterations takes 6419 microseconds
csum_partial: sz[1032] 1 iteration takes <641 microseconds>==<125 nanoseconds>
csum_partial: sz[1033] 10 iterations takes 6463 microseconds
csum_partial: sz[1033] 1 iteration takes <646 microseconds>==<126 nanoseconds>

csum_partial_copy_generic from 2.3.3 stock:

measure_csum_partial_copy
csum_partial_copy: sz[0] 10 iterations takes 380 microseconds
csum_partial_copy: sz[0] 1 iteration takes <38 microseconds>==<7 nanoseconds>
csum_partial_copy: sz[1] 10 iterations takes 479 microseconds
csum_partial_copy: sz[1] 1 iteration takes <47 microseconds>==<9 nanoseconds>
csum_partial_copy: sz[2] 10 iterations takes 549 microseconds
csum_partial_copy: sz[2] 1 iteration takes <54 microseconds>==<10 nanoseconds>
csum_partial_copy: sz[3] 10 iterations takes 615 microseconds
csum_partial_copy: sz[3] 1 iteration takes <61 microseconds>==<11 nanoseconds>
csum_partial_copy: sz[4] 10 iterations takes 369 microseconds
csum_partial_copy: sz[4] 1 iteration takes <36 microseconds>==<7 nanoseconds>
csum_partial_copy: sz[5] 10 iterations takes 490 microseconds
csum_partial_copy: sz[5] 1 iteration takes <49 microseconds>==<9 nanoseconds>
csum_partial_copy: sz[6] 10 iterations takes 489 microseconds
csum_partial_copy: sz[6] 1 iteration takes <48 microseconds>==<9 nanoseconds>
csum_partial_copy: sz[7] 10 iterations takes 604 microseconds
csum_partial_copy: sz[7] 1 iteration takes <60 microseconds>==<11 nanoseconds>
csum_partial_copy: sz[8] 10 iterations takes 398 microseconds
csum_partial_copy: sz[8] 1 iteration takes <39 microseconds>==<7 nanoseconds>
csum_partial_copy: sz[9] 10 iterations takes 569 microseconds
csum_partial_copy: sz[9] 1 iteration takes <56 microseconds>==<10 nanoseconds>
csum_partial_copy: sz[1024] 10 iterations takes 8445 microseconds
csum_partial_copy: sz[1024] 1 iteration takes <844 microseconds>==<164 nanoseconds>
csum_partial_copy: sz[1025] 10 iterations takes 8522 microseconds
csum_partial_copy: sz[1025] 1 iteration takes <852 microseconds>==<166 nanoseconds>
csum_partial_copy: sz[1026] 10 iterations takes 8521 microseconds
csum_partial_copy: sz[1026] 1 iteration takes <852 microseconds>==<166 nanoseconds>
csum_partial_copy: sz[1027] 10 iterations takes 8597 microseconds
csum_partial_copy: sz[1027] 1 iteration takes <859 microseconds>==<167 nanoseconds>
csum_partial_copy: sz[1028] 10 iterations takes 8460 microseconds
csum_partial_copy: sz[1028] 1 iteration takes <846 microseconds>==<165 nanoseconds>
csum_partial_copy: sz[1029] 10 iterations takes 8552 microseconds
csum_partial_copy: sz[1029] 1 iteration takes <855 microseconds>==<166 nanoseconds>
csum_partial_copy: sz[1030] 10 iterations takes 8565 microseconds
csum_partial_copy: sz[1030] 1 iteration takes <856 microseconds>==<167 nanoseconds>
csum_partial_copy: sz[1031] 10 iterations takes 8644 microseconds
csum_partial_copy: sz[1031] 1 iteration takes <864 microseconds>==<168 nanoseconds>
csum_partial_copy: sz[1032] 10 iterations takes 8473 microseconds
csum_partial_copy: sz[1032] 1 iteration takes <847 microseconds>==<165 nanoseconds>
csum_partial_copy: sz[1033] 10 iterations takes 8564 microseconds
csum_partial_copy: sz[1033] 1 iteration takes <856 microseconds>==<167 nanoseconds>

csum_partial_copy_generic + my patch applyed:

measure_csum_partial_copy
csum_partial_copy: sz[0] 10 iterations takes 362 microseconds
csum_partial_copy: sz[0] 1 iteration takes <36 microseconds>==<7 nanoseconds>
csum_partial_copy: sz[1] 10 iterations takes 501 microseconds
csum_partial_copy: sz[1] 1 iteration takes <50 microseconds>==<9 nanoseconds>
csum_partial_copy: sz[2] 10 iterations takes 490 microseconds
csum_partial_copy: sz[2] 1 iteration takes <49 microseconds>==<9 nanoseconds>
csum_partial_copy: sz[3] 10 iterations takes 581 microseconds
csum_partial_copy: sz[3] 1 iteration takes <58 microseconds>==<11 nanoseconds>
csum_partial_copy: sz[4] 10 iterations takes 377 microseconds
csum_partial_copy: sz[4] 1 iteration takes <37 microseconds>==<7 nanoseconds>
csum_partial_copy: sz[5] 10 iterations takes 524 microseconds
csum_partial_copy: sz[5] 1 iteration takes <52 microseconds>==<10 nanoseconds>
csum_partial_copy: sz[6] 10 iterations takes 501 microseconds
csum_partial_copy: sz[6] 1 iteration takes <50 microseconds>==<9 nanoseconds>
csum_partial_copy: sz[7] 10 iterations takes 625 microseconds
csum_partial_copy: sz[7] 1 iteration takes <62 microseconds>==<12 nanoseconds>
csum_partial_copy: sz[8] 10 iterations takes 405 microseconds
csum_partial_copy: sz[8] 1 iteration takes <40 microseconds>==<7 nanoseconds>
csum_partial_copy: sz[9] 10 iterations takes 559 microseconds
csum_partial_copy: sz[9] 1 iteration takes <55 microseconds>==<10 nanoseconds>
csum_partial_copy: sz[1024] 10 iterations takes 6580 microseconds
csum_partial_copy: sz[1024] 1 iteration takes <658 microseconds>==<128 nanoseconds>
csum_partial_copy: sz[1025] 10 iterations takes 6684 microseconds
csum_partial_copy: sz[1025] 1 iteration takes <668 microseconds>==<130 nanoseconds>
csum_partial_copy: sz[1026] 10 iterations takes 6684 microseconds
csum_partial_copy: sz[1026] 1 iteration takes <668 microseconds>==<130 nanoseconds>
csum_partial_copy: sz[1027] 10 iterations takes 6758 microseconds
csum_partial_copy: sz[1027] 1 iteration takes <675 microseconds>==<131 nanoseconds>
csum_partial_copy: sz[1028] 10 iterations takes 6650 microseconds
csum_partial_copy: sz[1028] 1 iteration takes <665 microseconds>==<129 nanoseconds>
csum_partial_copy: sz[1029] 10 iterations takes 6702 microseconds
csum_partial_copy: sz[1029] 1 iteration takes <670 microseconds>==<130 nanoseconds>
csum_partial_copy: sz[1030] 10 iterations takes 6703 microseconds
csum_partial_copy: sz[1030] 1 iteration takes <670 microseconds>==<130 nanoseconds>
csum_partial_copy: sz[1031] 10 iterations takes 6780 microseconds
csum_partial_copy: sz[1031] 1 iteration takes <678 microseconds>==<132 nanoseconds>
csum_partial_copy: sz[1032] 10 iterations takes 6613 microseconds
csum_partial_copy: sz[1032] 1 iteration takes <661 microseconds>==<129 nanoseconds>
csum_partial_copy: sz[1033] 10 iterations takes 6714 microseconds
csum_partial_copy: sz[1033] 1 iteration takes <671 microseconds>==<131 nanoseconds>

Patch against 2.3.3:

Index: linux/arch/i386/lib/checksum.S
===================================================================
RCS file: /var/cvs/linux/arch/i386/lib/checksum.S,v
retrieving revision 1.1.1.3
diff -u -r1.1.1.3 checksum.S
--- linux/arch/i386/lib/checksum.S 1999/05/18 20:17:13 1.1.1.3
+++ linux/arch/i386/lib/checksum.S 1999/05/24 01:40:00
@@ -18,6 +18,8 @@
* handling.
* Andi Kleen, add zeroing on error
* converted to pure assembler
+ * Andrea Arcangeli, fixed a potential buffer overflow in the
+ * 686 csum_partial, and some improvement in the 686 code.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
@@ -123,9 +125,6 @@
movl 16(%esp),%ecx # Function arg: int len
movl 12(%esp),%esi # Function arg: const unsigned char *buf

- testl $2, %esi
- jnz 30f
-10:
movl %ecx, %edx
movl %ecx, %ebx
andl $0x7c, %ebx
@@ -134,27 +133,9 @@
shrl $2, %ebx
negl %ebx
lea 45f(%ebx,%ebx,2), %ebx
- testl %esi, %esi
+ clc
jmp *%ebx

- # Handle 2-byte-aligned regions
-20: addw (%esi), %ax
- lea 2(%esi), %esi
- adcl $0, %eax
- jmp 10b
-
-30: subl $2, %ecx
- ja 20b
- je 32f
- movzbl (%esi),%ebx # csumming 1 byte, 2-aligned
- addl %ebx, %eax
- adcl $0, %eax
- jmp 80f
-32:
- addw (%esi), %ax # csumming 2 bytes, 2-aligned
- adcl $0, %eax
- jmp 80f
-
40:
addl -128(%esi), %eax
adcl -124(%esi), %eax
@@ -193,18 +174,20 @@
adcl $0, %eax
dec %ecx
jge 40b
- movl %edx, %ecx
-50: andl $3, %ecx
+50:
+ testl $3, %edx
jz 80f
-
- # Handle the last 1-3 bytes without jumping
- notl %ecx # 1->2, 2->1, 3->0, higher bits are masked
- movl $0xffffff,%ebx # by the shll and shrl instructions
- shll $3,%ecx
- shrl %cl,%ebx
- andl -128(%esi),%ebx # esi is 4-aligned so should be ok
- addl %ebx,%eax
+ testl $2, %edx
+ jz 70f
+ addw -128(%esi), %ax
+ lea 2(%esi), %esi
adcl $0,%eax
+70:
+ testl $1, %edx
+ jz 80f
+ movzbl -128(%esi), %ebx
+ addl %ebx, %eax
+ adcl $0, %eax
80:
popl %ebx
popl %esi
@@ -393,36 +376,44 @@
movl ARGBASE+16(%esp),%eax #sum
movl %ecx, %edx
movl %ecx, %ebx
- shrl $6, %ecx
- andl $0x3c, %ebx
+ andl $0x7c, %ebx
+ shrl $7, %ecx
+ addl %ebx,%esi
+ addl %ebx,%edi
+ shrl $2, %ebx
negl %ebx
- subl %ebx, %esi
- subl %ebx, %edi
- lea 3f(%ebx,%ebx), %ebx
- testl %esi, %esi
+ shll $3, %ebx
+ leal 3f(%ebx), %ebx
+ clc
jmp *%ebx
-1: addl $64,%esi
- addl $64,%edi
- ROUND1(-64) ROUND(-60) ROUND(-56) ROUND(-52)
+1:
+ ROUND1(-128) ROUND(-124) ROUND(-120) ROUND(-116)
+ ROUND (-112) ROUND(-108) ROUND(-104) ROUND(-100)
+ ROUND (-96) ROUND(-92) ROUND(-88) ROUND(-84)
+ ROUND (-80) ROUND(-76) ROUND(-72) ROUND(-68)
+ ROUND (-64) ROUND(-60) ROUND(-56) ROUND(-52)
ROUND (-48) ROUND(-44) ROUND(-40) ROUND(-36)
ROUND (-32) ROUND(-28) ROUND(-24) ROUND(-20)
ROUND (-16) ROUND(-12) ROUND(-8) ROUND(-4)
-3: adcl $0,%eax
+3:
+ leal 128(%esi), %esi
+ adcl $0,%eax
+ leal 128(%edi), %edi
dec %ecx
jge 1b
4: andl $3, %edx
jz 7f
cmpl $2, %edx
jb 5f
-SRC( movw (%esi), %dx )
+SRC( movw -128(%esi), %dx )
leal 2(%esi), %esi
-DST( movw %dx, (%edi) )
+DST( movw %dx, -128(%edi) )
leal 2(%edi), %edi
je 6f
shll $16,%edx
5:
-SRC( movb (%esi), %dl )
-DST( movb %dl, (%edi) )
+SRC( movb -128(%esi), %dl )
+DST( movb %dl, -128(%edi) )
6: addl %edx, %eax
adcl $0, %eax
7:

Andrea Arcangeli

PS. Yesterday I played a bit with mmx (ignoring the fsave cost that I
think could be reduced near zero by using the TS bit of the cr0) and it
seems impossible to me to get better peformances than the current no-mmx
cksum. The real problem is that MMX has no carry-capable instructions. I
have not tried with SIMD though (mainly because my cpus are not SIMD
capable).

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/