Re: [patch] checksum P6 asm buffer overflow fix + 686 improvements

Andrea Arcangeli (andrea@suse.de)
Mon, 24 May 1999 15:28:31 +0200 (CEST)


On Mon, 24 May 1999, Ingo Molnar wrote:

>cached (like in your benchmark). In cold-cache situations it's invisible.

clean 2.3.3:

measure_csum_partial_copy
csum_partial_copy: sz[0] 1 iterations takes 42 microseconds
csum_partial_copy: sz[0] 1 iteration takes <42 microseconds>==<8 nanoseconds>
csum_partial_copy: sz[1] 1 iterations takes 51 microseconds
csum_partial_copy: sz[1] 1 iteration takes <51 microseconds>==<9 nanoseconds>
csum_partial_copy: sz[2] 1 iterations takes 52 microseconds
csum_partial_copy: sz[2] 1 iteration takes <52 microseconds>==<10 nanoseconds>
csum_partial_copy: sz[3] 1 iterations takes 63 microseconds
csum_partial_copy: sz[3] 1 iteration takes <63 microseconds>==<12 nanoseconds>
csum_partial_copy: sz[4] 1 iterations takes 39 microseconds
csum_partial_copy: sz[4] 1 iteration takes <39 microseconds>==<7 nanoseconds>
csum_partial_copy: sz[5] 1 iterations takes 52 microseconds
csum_partial_copy: sz[5] 1 iteration takes <52 microseconds>==<10 nanoseconds>
csum_partial_copy: sz[6] 1 iterations takes 52 microseconds
csum_partial_copy: sz[6] 1 iteration takes <52 microseconds>==<10 nanoseconds>
csum_partial_copy: sz[7] 1 iterations takes 63 microseconds
csum_partial_copy: sz[7] 1 iteration takes <63 microseconds>==<12 nanoseconds>
csum_partial_copy: sz[8] 1 iterations takes 44 microseconds
csum_partial_copy: sz[8] 1 iteration takes <44 microseconds>==<8 nanoseconds>
csum_partial_copy: sz[9] 1 iterations takes 57 microseconds
csum_partial_copy: sz[9] 1 iteration takes <57 microseconds>==<11 nanoseconds>
csum_partial_copy: sz[1024] 1 iterations takes 854 microseconds
csum_partial_copy: sz[1024] 1 iteration takes <854 microseconds>==<166 nanoseconds>
csum_partial_copy: sz[1025] 1 iterations takes 869 microseconds
csum_partial_copy: sz[1025] 1 iteration takes <869 microseconds>==<169 nanoseconds>
csum_partial_copy: sz[1026] 1 iterations takes 862 microseconds
csum_partial_copy: sz[1026] 1 iteration takes <862 microseconds>==<168 nanoseconds>
csum_partial_copy: sz[1027] 1 iterations takes 896 microseconds
csum_partial_copy: sz[1027] 1 iteration takes <896 microseconds>==<175 nanoseconds>
csum_partial_copy: sz[1028] 1 iterations takes 857 microseconds
csum_partial_copy: sz[1028] 1 iteration takes <857 microseconds>==<167 nanoseconds>
csum_partial_copy: sz[1029] 1 iterations takes 866 microseconds
csum_partial_copy: sz[1029] 1 iteration takes <866 microseconds>==<169 nanoseconds>
csum_partial_copy: sz[1030] 1 iterations takes 866 microseconds
csum_partial_copy: sz[1030] 1 iteration takes <866 microseconds>==<169 nanoseconds>
csum_partial_copy: sz[1031] 1 iterations takes 876 microseconds
csum_partial_copy: sz[1031] 1 iteration takes <876 microseconds>==<171 nanoseconds>
csum_partial_copy: sz[1032] 1 iterations takes 859 microseconds
csum_partial_copy: sz[1032] 1 iteration takes <859 microseconds>==<167 nanoseconds>
csum_partial_copy: sz[1033] 1 iterations takes 868 microseconds
csum_partial_copy: sz[1033] 1 iteration takes <868 microseconds>==<169 nanoseconds>

2.3.3 + my patch:

measure_csum_partial_copy
csum_partial_copy: sz[0] 1 iterations takes 40 microseconds
csum_partial_copy: sz[0] 1 iteration takes <40 microseconds>==<7 nanoseconds>
csum_partial_copy: sz[1] 1 iterations takes 54 microseconds
csum_partial_copy: sz[1] 1 iteration takes <54 microseconds>==<10 nanoseconds>
csum_partial_copy: sz[2] 1 iterations takes 53 microseconds
csum_partial_copy: sz[2] 1 iteration takes <53 microseconds>==<10 nanoseconds>
csum_partial_copy: sz[3] 1 iterations takes 64 microseconds
csum_partial_copy: sz[3] 1 iteration takes <64 microseconds>==<12 nanoseconds>
csum_partial_copy: sz[4] 1 iterations takes 42 microseconds
csum_partial_copy: sz[4] 1 iteration takes <42 microseconds>==<8 nanoseconds>
csum_partial_copy: sz[5] 1 iterations takes 55 microseconds
csum_partial_copy: sz[5] 1 iteration takes <55 microseconds>==<10 nanoseconds>
csum_partial_copy: sz[6] 1 iterations takes 55 microseconds
csum_partial_copy: sz[6] 1 iteration takes <55 microseconds>==<10 nanoseconds>
csum_partial_copy: sz[7] 1 iterations takes 65 microseconds
csum_partial_copy: sz[7] 1 iteration takes <65 microseconds>==<12 nanoseconds>
csum_partial_copy: sz[8] 1 iterations takes 44 microseconds
csum_partial_copy: sz[8] 1 iteration takes <44 microseconds>==<8 nanoseconds>
csum_partial_copy: sz[9] 1 iterations takes 55 microseconds
csum_partial_copy: sz[9] 1 iteration takes <55 microseconds>==<10 nanoseconds>
csum_partial_copy: sz[1024] 1 iterations takes 822 microseconds
csum_partial_copy: sz[1024] 1 iteration takes <822 microseconds>==<160 nanoseconds>
csum_partial_copy: sz[1025] 1 iterations takes 832 microseconds
csum_partial_copy: sz[1025] 1 iteration takes <832 microseconds>==<162 nanoseconds>
csum_partial_copy: sz[1026] 1 iterations takes 831 microseconds
csum_partial_copy: sz[1026] 1 iteration takes <831 microseconds>==<162 nanoseconds>
csum_partial_copy: sz[1027] 1 iterations takes 841 microseconds
csum_partial_copy: sz[1027] 1 iteration takes <841 microseconds>==<164 nanoseconds>
csum_partial_copy: sz[1028] 1 iterations takes 825 microseconds
csum_partial_copy: sz[1028] 1 iteration takes <825 microseconds>==<161 nanoseconds>
csum_partial_copy: sz[1029] 1 iterations takes 833 microseconds
csum_partial_copy: sz[1029] 1 iteration takes <833 microseconds>==<162 nanoseconds>
csum_partial_copy: sz[1030] 1 iterations takes 834 microseconds
csum_partial_copy: sz[1030] 1 iteration takes <834 microseconds>==<162 nanoseconds>
csum_partial_copy: sz[1031] 1 iterations takes 842 microseconds
csum_partial_copy: sz[1031] 1 iteration takes <842 microseconds>==<164 nanoseconds>
csum_partial_copy: sz[1032] 1 iterations takes 835 microseconds
csum_partial_copy: sz[1032] 1 iteration takes <835 microseconds>==<163 nanoseconds>
csum_partial_copy: sz[1033] 1 iterations takes 837 microseconds
csum_partial_copy: sz[1033] 1 iteration takes <837 microseconds>==<163 nanoseconds>

It seems it still worth to unroll to a 128 offset even with the data is
not cached. I did all tests in cache are just to measure the real cost of
CPU and not of memory speed. And the data may be in cache. userspace may
have written it a bit before calling write(). The cachesize could also
enlarge with newer CPU. I also like to optimize for long checksum, since
the bottleneck is when we have to checksum tons of data (for iteractive
network connections cksum_copy is far to be shown in the profiling). After
a `netcat localhost discard </dev/null' look at the profiling:

andrea@laser:~$ readprofile -m /System.map |sort -nr | head -2
4186 total 0.0061
664 csum_partial_copy_generic 1.7849

If the size of the packets is little then it's not critical to have a
fast cksum routine.

>doing fast MMX TCP checksums is possible, even if the MMX engine doesnt
>have a carry logic, this is from a csum routine i wrote a year ago:
>
> movq %%mm1, %%mm3;
> paddd (%%esi),%%mm1;
> pcmpgtd %%mm1, %%mm3;
> psubd %%mm3, %%mm1;

Yes I tried the above but it's been a lose but maybe it's been my too
much unrolled loop that harmed...

>demonstrates the method nicely), but i finally found that the FPU handling
>complexity is simply not worth it. More and more networking cards are
>doing IP checksumming anyway.

Here it is the numbers of your old _plain_ mmx code with the FPU cost
_excluded_:

2.3.3 + my patch applyed:

measure_csum_partial
csum_partial: sz[1024] 100 iterations takes 63809 microseconds
csum_partial: sz[1024] 1 iteration takes <638 microseconds>==<124 nanoseconds>
csum_partial: sz[1025] 100 iterations takes 63707 microseconds
csum_partial: sz[1025] 1 iteration takes <637 microseconds>==<124 nanoseconds>
csum_partial: sz[1026] 100 iterations takes 65380 microseconds
csum_partial: sz[1026] 1 iteration takes <653 microseconds>==<127 nanoseconds>
csum_partial: sz[1027] 100 iterations takes 65076 microseconds
csum_partial: sz[1027] 1 iteration takes <650 microseconds>==<126 nanoseconds>
csum_partial: sz[1028] 100 iterations takes 63870 microseconds
csum_partial: sz[1028] 1 iteration takes <638 microseconds>==<124 nanoseconds>
csum_partial: sz[1029] 100 iterations takes 64300 microseconds
csum_partial: sz[1029] 1 iteration takes <643 microseconds>==<125 nanoseconds>
csum_partial: sz[1030] 100 iterations takes 65221 microseconds
csum_partial: sz[1030] 1 iteration takes <652 microseconds>==<127 nanoseconds>
csum_partial: sz[1031] 100 iterations takes 65229 microseconds
csum_partial: sz[1031] 1 iteration takes <652 microseconds>==<127 nanoseconds>
csum_partial: sz[1032] 100 iterations takes 64170 microseconds
csum_partial: sz[1032] 1 iteration takes <641 microseconds>==<125 nanoseconds>
csum_partial: sz[1033] 100 iterations takes 64678 microseconds
csum_partial: sz[1033] 1 iteration takes <646 microseconds>==<126 nanoseconds>

your old mmx checksum code:

measure_csum_partial
csum_partial: sz[1024] 100 iterations takes 59144 microseconds
csum_partial: sz[1024] 1 iteration takes <591 microseconds>==<-2147483648 nanoseconds>
csum_partial: sz[1025] 100 iterations takes 60409 microseconds
csum_partial: sz[1025] 1 iteration takes <604 microseconds>==<-2147483648 nanoseconds>
csum_partial: sz[1026] 100 iterations takes 60463 microseconds
csum_partial: sz[1026] 1 iteration takes <604 microseconds>==<-2147483648 nanoseconds>
csum_partial: sz[1027] 100 iterations takes 61917 microseconds
csum_partial: sz[1027] 1 iteration takes <619 microseconds>==<-2147483648 nanoseconds>
csum_partial: sz[1028] 100 iterations takes 60216 microseconds
csum_partial: sz[1028] 1 iteration takes <602 microseconds>==<-2147483648 nanoseconds>
csum_partial: sz[1029] 100 iterations takes 62149 microseconds
csum_partial: sz[1029] 1 iteration takes <621 microseconds>==<-2147483648 nanoseconds>
csum_partial: sz[1030] 100 iterations takes 61098 microseconds
csum_partial: sz[1030] 1 iteration takes <610 microseconds>==<-2147483648 nanoseconds>
csum_partial: sz[1031] 100 iterations takes 62289 microseconds
csum_partial: sz[1031] 1 iteration takes <622 microseconds>==<-2147483648 nanoseconds>
csum_partial: sz[1032] 100 iterations takes 61329 microseconds
csum_partial: sz[1032] 1 iteration takes <613 microseconds>==<-2147483648 nanoseconds>
csum_partial: sz[1033] 100 iterations takes 62497 microseconds
csum_partial: sz[1033] 1 iteration takes <624 microseconds>==<-2147483648 nanoseconds>

Yes it improved performances, but I am not sure if it worth since I still
have to add the fpu_save trick.

Adding the ftp_save trick should be quite easy, I think we could do only
what we do in schedule:

unlazy_fpu(current);
mmx_checksum(); /* play with FPU as you like without saving */

Then when the process will return to use the fpu it will generate a device
not found exception that will restore its old FPU state back...

But with the mmx done as above if somebody is going to use opengl +
network heavily then he will go slower than using the current cksum 686
code (also considering the many exceptions that will generate the unlazy
trick). A router alone probably will run faster instead but the
improvement is not large enough to convince me to use MMX...

Andrea Arcangeli

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/