RE: [PATCH] lib/x86: Optimise csum_partial of buffers that are not multiples of 8 bytes.

From: David Laight
Date: Mon Dec 13 2021 - 17:52:23 EST


From: Alexander Duyck <alexanderduyck@xxxxxx>
> Sent: 13 December 2021 18:40
...
> > Add in the trailing bytes first so that there is no need to worry about the sum
> > exceeding 64 bits.
> >
> > Signed-off-by: David Laight <david.laight@xxxxxxxxxx>
> > ---
> >
> > This ought to be faster - because of all the removed 'adc $0'.
> > Guessing how fast x86 code will run is hard!
> > There are other ways of handing buffers that are shorter than 8 bytes, but I'd
> > rather hope they don't happen in any hot paths.
> >
> > Note - I've not even compile tested it.
> > (But have tested an equivalent change before.)
> >
> > arch/x86/lib/csum-partial_64.c | 55 ++++++++++++----------------------
> > 1 file changed, 19 insertions(+), 36 deletions(-)
> >
> > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> > index abf819dd8525..fbcc073fc2b5 100644
> > --- a/arch/x86/lib/csum-partial_64.c
> > +++ b/arch/x86/lib/csum-partial_64.c
> > @@ -37,6 +37,24 @@ __wsum csum_partial(const void *buff, int len,
> > __wsum sum)
> > u64 temp64 = (__force u64)sum;
> > unsigned result;
> >
> > + if (len & 7) {
> > + if (unlikely(len < 8)) {
> > + /* Avoid falling off the start of the buffer */
> > + if (len & 4) {
> > + temp64 += *(u32 *)buff;
> > + buff += 4;
> > + }
> > + if (len & 2) {
> > + temp64 += *(u16 *)buff;
> > + buff += 2;
> > + }
> > + if (len & 1)
> > + temp64 += *(u8 *)buff;
> > + goto reduce_to32;
> > + }
> > + temp64 += *(u64 *)(buff + len - 8) << (8 - (len & 7)) * 8;
> > + }
> > +
>
> I don't think your shift is headed in the right direction. If your starting offset is "buff + len - 8"
> then your remaining bits should be in the upper bytes of the qword, not the lower bytes shouldn't
> they? So I would think it should be ">>" not "<<".

Brain-fart :-)
It needs to discard the low bytes - so >> is indeed right.
I did say I hadn't tested it.

Cache line wise I'm not sure whether it matters.
If the data is in the cache it doesn't matter.
If the data isn't in the cache then the only real problem is if the
line gets evicted - only likely for 4k-ish+ buffers.
I'd guess the largest checksum is under 1500 bytes - hardware doing
TSO will be doing hardware checksums. So evections are unlikely.

Plausibly the *(buf + len - 8) read could be done after the while() loop.
That would need an adc and a saved copy of the length (or a read that would trap)
but would only be loading the 'next' cache line.

So you'd end up with something like:
while (len >= 64) {
...
}
if (len & 7)
trail = *(u64 *)(buff + len - 8) >> (8 - (len & 7)) * 8;
if (len & 32)
...
if (len & 16)
...
if (len & 8)
...
temp64 += trail
adc $0, temp64

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)