Re: TCP/IP Checksumming

Richard B. Johnson (root@analogic.com)
Tue, 26 Nov 1996 09:51:31 -0500 (EST)


On 25 Nov 1996, Tom May wrote:

>
> "Richard B. Johnson" <root@analogic.com> writes:
>
> >This is intended to be a definitive statement about the TCP/IP
> >checksumming.
>
> I appreciate the work you have done, but you have made a large
> mistake. Read on.
>
> [...]
>
>
> >This will operate under MS-DOS if you remove your memory manager.
> >This code does not enter protected mode so you can readily use
> >MS-DOS facilities to record results and modify code. The 32-bit
> >instructions work in real mode just as they do in protected mode
> >as long as you don't attempt to use memory offsets greater than 64k.
>
> No, they don't. Did you read the comment on the code I sent you,
> which said to run it from a 32-bit segment? I at least do you the
> courtesy of reading what you have to say. Using 32-bit instructions
> in 16-bit DOS requires one or more prefixes to specify they are using
> 32-bit addressing and/or operands. These prefixes require one extra
> cycle each and prevent the instruction from running in the V-pipe
> which means no parallelism is obtained. No parallelism gives a factor
> of two performance hit for this code. The extra cycles for the
> prefixes give even more performance hit. Take away those performance
> hits and you will see the 10:1 ratio.

The performance hit will be the same for all procedures tested. The
purpose of the test-bench is to perform relative testing. The ratio
changes less than 1 percent although the machine-cycles are less.

I have/can run this in 32-bit protected mode, however the only I/O
I have under these conditions is an RS-232C port. The reason for
the very primative output, i.e., hex-ASCII was that I didn't want
to have to do a day's work to write 64 bits of decimal on the screen.

The machine cycles are less in a 32-bit segment, but they are less for
everything. I wanted to be able to allow persons to "tune" their stuff,
and re-compile, rather than having to re-boot, etc.

Note that using a 32-bit segment does not eliminate the size-prefix for
everything. In particular, you can't get a carry out of a 16-bit overflow
into a 32-bit register. Using instructions that always access 32-bits
forces one to accumulate carries and then fold them back at the end.
This is shown to be inefficient even though no size prefix would
be required.

Looking at my data book for the i486, I see that it MAY be possible
to get back to real mode from protected mode without a processor reset.
If true, I might rewrite the test-bench to do the procedures in protected
mode with 32-bit segments, then return (actually jump) back to real mode
and display the results. I don't need valid interrupt gates because I
want the interrupts off during the tests anyway.

I must defer this experiment for several weeks though because I now
have a non-maskable interrupt called work to do.

>
> For some things, yes, but not for 32-bit code. Any chance you could
> redo this in a 32-bit segment? Also, go ahead and use the code I sent
> you.
>
> Also, try this one (in 16-bit mode) to see how bad lodsw/loop really
> are. This is how SIMPLE_NOLODSW should have been written, and is a fair
> comparison with SIMPLE_CHKSUM since it uses the same size operands
> and addresses:
>
> SPUD_CHKSUM PROC NEAR
> MOV ECX, (PACKET_LEN SHR 1) ; Get packet length
> MOV ESI, OFFSET IP_PACKET ; Point to packet
> XOR EDX, EDX ; Clear CY, zero
> ;
> SPUD: MOV AX, [SI]
> LEA SI,[SI+2]
> ADC DX,AX ; Sum words only with previous CY
> DEC CX
> JNZ SPUD
> ADC DX,0 ; Possible last CY
> MOV EAX,EDX ; Return in EAX to be fair
> RET
> SPUD_CHKSUM ENDP
>
> Tom.

If you muck with the code, you will notice that the big loser is the
LOOP instruction. I didn't think of using LEA in any of the demo code.
However, a possible improvement to your example is:

SPUD: ADC DX, WORD PTR [SI]
LEA SI, [SI+2]
DEC CX ; Fortunately this doesn't affect CY
JNZ SPUD

This is why I wrote the test-bench. I'll bet that if you were to combine
the loop-unrolling, and some other things you discover, you will arrive
with code that is much improved over the existing Linux Checksum.

When you use LODSW and friends, you have to consider if it was really
necessary to bump the pointer register and not change the flags. Note
that carry-flags will be lost with an increment of the pointer register.

in the test code, it turns out that :
[EBX] ... [EBX+2] ... [EBX+4] seems to run very well
even though the displacement constant is not free either.

Cheers,
Dick Johnson
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Richard B. Johnson
Project Engineer
Analogic Corporation
Voice : (508) 977-3000 ext. 3754
Fax : (508) 532-6097
Modem : (508) 977-6870
Ftp : ftp@boneserver.analogic.com
Email : rjohnson@analogic.com, johnson@analogic.com
Penguin : Linux version 2.1.13 on an i586 machine.
Warning : It's hard to remain at the trailing edge of technology.
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-