Re: ia32 ip checksum optimizations

Artur Skawina (skawina@geocities.com)
Wed, 26 May 1999 05:13:24 +0200


Andrea Arcangeli wrote:
>
> On Tue, 25 May 1999, Artur Skawina wrote:
>
> >it's faster for the aligned buffer, multiple-of-four len case, it's
> >slower when len&3, and it's a lot slower for the unaligned buffer case.
>
> [..]
> csum_partial:
> pushl %esi
> pushl %ebx
> movl 20(%esp),%eax # Function arg: unsigned int sum
> movl 16(%esp),%ecx # Function arg: int len
> movl 12(%esp),%esi # Function arg: const unsigned char *buf
>
> testl $2, %esi
> ^^^^^^^^^^^^^^
> jnz 30f
> [..]
>
> The underlined line won't check for 4byte alignment of %esi. I think you
> are missing this bit of the current 686 csum_partial. Otherwise I don't
> follow you.

It does test for 32bit aligment - this routine can not be called
with buf&1, so it's enough to just check this bit.
Note that this check is required to optimize away the branches
at the end (the part you removed).

> You now should note that %esi maybe equal to 1 or 0x11 or 0x21 and you'll
> go ahead in the unrolled loop. That's not exactly the bug but now I think

no, this code can simply not be called with such a buffer.

> to see why you don't see the `andl' buggy. Yes, if %esi would be 4byte
> aligned then the andl wouldn't trigger an Oops or a segfault in userspace,
> even if was reading in overflow after the buffer boundary for some byte,
> because pages of memory are always 4byte aligned. But with the current

correct.

> To check if %esi is 4byte aligned you should do:
>
> testl $3, %esi
>
> instead, and then rewrite the slow path to 4byte align %esi (instead of
> computing the checksum for two bytes and go ahead as happens now).

the current code _is_ enough.

> >> And btw, I think %esi is going to be aligned.
> >
> >and you have verified this?
>
> The skb->data is going to be aligned because skb->data originally is a
> kmalloced area (kmalloc return regions aligned with the L1 cache that it's
> 32 or 16 byte in all x86), then skb->data it's advanced by skb_pull of
> 4byte aligned regions. I am not aware of an hardware protocol that uses an
> header not 32bit aligned though.
>
> Even if it could happen to have in an alien-case the checksum pool not
> 32bit aligned, that wouldn't be sure the standard case, and it would work
> _fine_ anyway, just with the cksum runing 44% slower.

Hmm, do you realize that this directly contradicts your other theory? ;)
[you can either have %esi 32bit aligned or not, but hardly both...]

But seriously, all data I have gathered so far indicates that the
buffer is always 32bit aligned. But until this is confirmed
to be guaranteed, it's too early to remove that check; esp. since
the gain is so microscopic.

> Did you ever seen my patch? The numbers are included in my email with the
> patch attacched.

Arrived while I was typing that msg.

> andl -128(%esi),%ebx # esi is 4-aligned so should be ok
> ^^^ esi won't be 4byte aligned for the
> point above

it will, see above

> To make you happy now I also reproduced in the kernel. Look this
> mini-kernel module:

> csum_partial1(buff+1, PAGE_SIZE-1, 0);

> Unable to handle kernel paging request at virtual address c281b000
> ^^^^^^^^ next page

Then don't do this :)

The routine is not supposed to be called like this. Throwing random, illegal
args at internal functions doesn't make much sense, does it?

> ------------ change topic to "having %esi 4byte aligned or not" ----------

I really am surprised that one can argue these two things in one msg,
interleaved :)

> If you would find a case where skb->data is not 4byte aligned then you
> would convince me to _write_ the code to make %esi 32bit aligned before
> entering the unrolled loop.

one of the reasons for that patch (in the iackk tarball) was to identify
such cases, but it hasn't found one. Still that doesn;t necessarily mean
they don't exist.

> And if something, we could consider to change csum_partial_copy_generic to
> try to align at least one between %esi and %edi if they are different and
> both not 32bit aligned.

the <686 csum_partial_copy_generic already does align the destination.

> Note also that %esi in the csum-copy case, is the pointer given from
> userspace. And userspace can also provide a not 32bit aligned region. But
> %edi is the skb that will just be 32bit aligned. And so you are just in
> the best condition you can have, since you can't 32bit align both %esi and
> %edi if they are different.
>
> Finally right now I am still very happy with my patch I posted in the list
> (I had not the need to change it (yet :)) and it still think it should be
> applyed to both 2.3.3 and 2.2.9 (for the record: I don't know if it will
> apply cleanly to 2.2.9 but once applyed to 2.3.3 the checksum.S of 2.3.3
> can be copyed with cp to the 2.2.9 tree to automagically fix also the %esi
> and %ebx clobbering that got fixed somewhere between 2.2.9 and 2.3.3).

2.3.3 (the 686 routines still are not used in 2.2, but there haven't been
a 2.2 release in the mean time; I assume Linus applied the patch to both
trees)

> Comments?

Now, with the alignment issue behind, how about concentrating
on improving the routines a bit more?

artur

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/