Re: [PATCH v8 1/5] asm-generic: Improve csum_fold

From: Charlie Jenkins
Date: Fri Oct 27 2023 - 20:05:08 EST


On Sat, Oct 28, 2023 at 12:10:36AM +0100, Al Viro wrote:
> On Fri, Oct 27, 2023 at 03:43:51PM -0700, Charlie Jenkins wrote:
> > /*
> > * computes the checksum of a memory block at buff, length len,
> > * and adds in "sum" (32-bit)
> > @@ -31,9 +33,7 @@ extern __sum16 ip_fast_csum(const void *iph, unsigned int ihl);
> > static inline __sum16 csum_fold(__wsum csum)
> > {
> > u32 sum = (__force u32)csum;
> > - sum = (sum & 0xffff) + (sum >> 16);
> > - sum = (sum & 0xffff) + (sum >> 16);
> > - return (__force __sum16)~sum;
> > + return (__force __sum16)((~sum - ror32(sum, 16)) >> 16);
> > }
>
> Will (~(sum + ror32(sum, 16))>>16 produce worse code than that?
> Because at least with recent gcc this will generate the exact thing
> you get from arm inline asm...

Yes that will produce worse code because an out-of-order processor will be able to
leverage that ~sum and ror32(sum, 16) can be computed independently of
each other. There are more strict data dependencies in (~(sum +
ror32(sum, 16))>>16.

- Charlie