Re: IP Checksumming

Richard B. Johnson (root@analogic.com)
Thu, 21 Nov 1996 09:09:15 -0500 (EST)


On 20 Nov 1996, Tom May wrote:

>
> "Richard B. Johnson" <root@analogic.com> writes:
>
> >
> > It minimizes the number of jumps which flush prefetch buffers and
> > messes up the speed.
>
> What are you talking about? Your code does one jump per word. The
> existing code does one jump per 16 words.

If the loop is within the cache-line there is no penalty. The timing
is as specified in the manual. By the time the loop instruction is
executed the second time all code will be within the cache even if
there was as penality on the first execution.

>
> The tricks are loop unrolling and minimizing Pentium pipeline stalls.
> They work.
>
> > However, every time a branch
> > on compare occurs, there is an enormous penality because the prefetch
> > buffer must be flushed.
>
> Not true with Pentium branch prediction. I know how the Pentium
> algorithm works and it will predict our branches correctly in almost
> all cases. The bad thing about the branch instruction on a Pentium is
> that it must be executed at all.
>
> > Further, the present routines in ../../asm
> > don't take advantage of the Intel architecture.
>
> You mean the lodsw and loop instructions? Those have been losers
> since the 386.

Not true. These built-in macros are responsible for much of the performance
improvements over chips such as the 68k providing the developer took the
time to use them.

>
> Just to prove that the existing stuff is pretty damn good (although
> not optimal, I've submitted a patch that hasn't been applied yet and
> I've recently got some more stuff to look at), I converted your code
> and timed it against the kernel code on a 486. These are *times*, so
> smaller is better:
>
> Your code: 3259
> Kernel code: 328
>
> Your code is an order of magnitude slower. Expect results on a
> Pentium to be more spectacularly different.

Since you have taken the time to convert the routine, and you have
a pentium, you are quite obviously interested in making the checksum
routines as good as possible. Therefore, I suggest the following
which can only be tested on a pentium.

The pentium has an internal cycle-counter. It is accessed by the
following two-byte instruction : 0x0f , 0x31. This loads the existing
internal cycles that have existed since reset into edx:eax (a quadword).

If you grab this value before ANY routine you want to test, and grab
the value after the routine, you can time the actual number of cycles
that it took to execute the routine. Warning, you must clear interrupts
before, or you will be timing (unknown) interrupt service routines as
well as your code.

I think there is a problem with testing the speed of the checksum
routines because changes in execution speed are way down in the
noise level when you test "networking" results.

>
> Here's your converted routine. You're welcome to put it in your own
> test harness and time it against the kernel code yourself. Note that
> it breaks if len is not a multiple of 2.
>
> unsigned int csum_partial(const unsigned char * buff, int len, unsigned int sum) {
> __asm__("
> testl %%edx,%%edx

Got to XOR here ( xorl %%edx,%%edx ???)

> 0: lodsw # Get word, source_addr++
> adcl %%eax,%%edx # Sum to accumulator + CY
> loop 0b
> adcl $0,%%edx # Possible last carry
> "
> : "=d"(sum)
> : "0"(sum), "c"(len/2), "S"(buff)
> : "ax", "cx", "si");
> return(sum);
> }
>
> Tom.
>

Thank you for the conversion. Please try to use the cycle-counter to
time things that you are trying to optimize. If you have problems with
the Intel stuff, that require such coding herorics, I think you must
have a bad cache or something else radically wrong with the hardware. I
assure you that the 368, 486, and Pentium should show only very small,
i.e., 3 to 5 percent improvements due to alignment. Note that 486 and
later chips have internal caches as well at the Board Vendor's external
caches. The fact that you measure an order-of-magnitude difference
between your routines and the usual, which I have used for many years,
means that either something is wrong with the measurement or something
is wrong with the hardware.

Now, maybe we are all being stuck with bad hardware. If true, your
code is probably optimum. But, I'll bet that if you cycle-time your
routines, you may find that there were external influences that are
messing up your measurement results.

Cheers,
Dick Johnson
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Richard B. Johnson
Project Engineer
Analogic Corporation
Voice : (508) 977-3000 ext. 3754
Fax : (508) 532-6097
Modem : (508) 977-6870
Ftp : ftp@boneserver.analogic.com
Email : rjohnson@analogic.com, johnson@analogic.com
Penguin : Linux version 2.1.11 on an i586 machine.
Warning : It's hard to remain at the trailing edge of technology.
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-