Re: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c
From: Noah Goldstein
Date: Fri Nov 26 2021 - 13:42:58 EST
On Fri, Nov 26, 2021 at 10:09 AM Eric Dumazet <edumazet@xxxxxxxxxx> wrote:
>
> On Thu, Nov 25, 2021 at 6:15 PM Noah Goldstein <goldstein.w.n@xxxxxxxxx> wrote:
> >
> > On Thu, Nov 25, 2021 at 7:50 PM Eric Dumazet <edumazet@xxxxxxxxxx> wrote:
> > >
> > > On Thu, Nov 25, 2021 at 11:38 AM Noah Goldstein <goldstein.w.n@xxxxxxxxx> wrote:
> > > >
> > > > Modify the 8x loop to that it uses two independent
> > > > accumulators. Despite adding more instructions the latency and
> > > > throughput of the loop is improved because the `adc` chains can now
> > > > take advantage of multiple execution units.
> > >
> > > Nice !
> > >
> > > Note that I get better results if I do a different split, because the
> > > second chain gets shorter.
> > >
> > > First chain adds 5*8 bytes from the buffer, but first bytes are a mere
> > > load, so that is really 4+1 additions.
> > >
> > > Second chain adds 3*8 bytes from the buffer, plus the result coming
> > > from the first chain, also 4+1 additions.
> >
> > Good call. With that approach I also see an improvement for the 32 byte
> > case (which is on the hot path) so this change might actually matter :)
> >
> > >
> > > asm("movq 0*8(%[src]),%[res_tmp]\n\t"
> > > "addq 1*8(%[src]),%[res_tmp]\n\t"
> > > "adcq 2*8(%[src]),%[res_tmp]\n\t"
> > > "adcq 3*8(%[src]),%[res_tmp]\n\t"
> > > "adcq 4*8(%[src]),%[res_tmp]\n\t"
> > > "adcq $0,%[res_tmp]\n\t"
> > > "addq 5*8(%[src]),%[res]\n\t"
> > > "adcq 6*8(%[src]),%[res]\n\t"
> > > "adcq 7*8(%[src]),%[res]\n\t"
> > > "adcq %[res_tmp],%[res]\n\t"
> > > "adcq $0,%[res]"
> > > : [res] "+r" (temp64), [res_tmp] "=&r"(temp_accum)
> > > : [src] "r" (buff)
> > > : "memory");
> > >
> > >
> > > >
> > > > Make the memory clobbers more precise. 'buff' is read only and we know
> > > > the exact usage range. There is no reason to write-clobber all memory.
> > >
> > > Not sure if that matters in this function ? Or do we expect it being inlined ?
> >
> > It may matter for LTO build. I also think it can matter for the loop
> > case. I didn't see
> > any difference when playing around with the function in userland with:
> >
> > ```
> > gcc -O3 -march=native -mtune=native checksum.c -o checksum
> > ```
> >
> > but IIRC if the clobber is loops with inline assembly payloads can be
> > de-optimized if GCC can't prove the iterations don't affect each other.
> >
> >
> > >
> > > Personally, I find the "memory" constraint to be more readable than these casts
> > > "m"(*(const char(*)[64])buff));
> > >
> >
> > Hmm, I personally find it more readable if I can tell what memory
> > transforms happen
> > just from reading the clobbers, but you're the maintainer.
>
> I am not x86 maintainer at all, I was only giving an opinion.
>
> >
> > Do you want it changed in V2?
>
> Let's hear from x86 maintainers :)
>
> >
> > > >
> > > > Relative performance changes on Tigerlake:
> > > >
> > > > Time Unit: Ref Cycles
> > > > Size Unit: Bytes
> > > >
> > > > size, lat old, lat new, tput old, tput new
> > > > 0, 4.972, 5.054, 4.864, 4.870
> > >
> > > Really what matters in modern networking is the case for 40 bytes, and
> > > eventually 8 bytes.
> > >
> > > Can you add these two cases in this nice table ?
> > >
> >
> > Sure, with your suggestion in the 32 byte cases there is an improvement there
> > too.
> >
> > > We hardly have to checksum anything with NIC that are not decades old.
> > >
> > > Apparently making the 64byte loop slightly longer incentives gcc to
> > > move it away (our intent with the unlikely() hint).
> > >
> > > Anyway I am thinking of providing a specialized inline version for
> > > IPv6 header checksums (40 + x*8 bytes, x being 0 pretty much all the
> > > time),
> > > so we will likely not use csum_partial() anymore.
> >
> > I see. For now is it worth adding a case for 40 in this implementation?
> >
> > if(likely(len == 40)) {
> > // optimized 40 + buff aligned case
> > }
> > else {
> > // existing code
> > }
>
> This is what I had first in my tree to demonstrate what gains could be achieved.
>
> But decided to keep csum_partial() as a generic function, yet optimize it a bit.
>
> My plan is to have a new inline helper, that could be a mere
> csum_partial() on arches that do not have
> an optimized version.
>
> Preliminary patch to show the idea (this only compiles on x86, we have
> to add the generic definition)
> First call point is in GRO stack.
> Note some of us in the US are not working until next Monday.
>
> diff --git a/arch/x86/include/asm/checksum_64.h
> b/arch/x86/include/asm/checksum_64.h
> index 407beebadaf45a748f91a36b78bd1d023449b132..af93fb53b480ab7102db71c32ab6ca9604c6b5fb
> 100644
> --- a/arch/x86/include/asm/checksum_64.h
> +++ b/arch/x86/include/asm/checksum_64.h
> @@ -182,4 +182,26 @@ static inline __wsum csum_add(__wsum csum, __wsum addend)
> (__force unsigned)addend);
> }
>
> +static inline __wsum ipv6_csum_partial(const void *buff, int len, __wsum sum)
> +{
> + u64 temp64;
> +
> + if (unlikely(len == 40))
> + return csum_partial(buff, len, sum);
> +
> + temp64 = (__force u64)sum;
> + asm("addq 0*8(%[src]),%[res]\n\t"
> + "adcq 1*8(%[src]),%[res]\n\t"
> + "adcq 2*8(%[src]),%[res]\n\t"
> + "adcq 3*8(%[src]),%[res]\n\t"
> + "adcq 4*8(%[src]),%[res]\n\t"
> + "adcq 5*8(%[src]),%[res]\n\t"
> + "adcq $0,%[res]"
> + : [res] "+r" (temp64)
> + : [src] "r" (buff)
> + : "memory");
> + return (__force __wsum)add32_with_carry(temp64 >> 32, temp64);
Makes sense. Although if you inline I think you definitely will want a more
conservative clobber than just "memory". Also I think with 40 you also will
get some value from two counters.
Did you see the number/question I posted about two accumulators for 32
byte case?
Its a judgement call about latency vs throughput that I don't really have an
answer for.
> +}
> +#define ipv6_csum_partial ipv6_csum_partial
> +
> #endif /* _ASM_X86_CHECKSUM_64_H */
> diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
> index 1b9827ff8ccf48e61e233e39d671aa67c8fff0ab..50d1d0c92bdce37ac2f48ded8edec02e1ba9096d
> 100644
> --- a/net/ipv6/ip6_offload.c
> +++ b/net/ipv6/ip6_offload.c
> @@ -274,7 +274,9 @@ INDIRECT_CALLABLE_SCOPE struct sk_buff
> *ipv6_gro_receive(struct list_head *head,
> NAPI_GRO_CB(skb)->is_atomic = true;
> NAPI_GRO_CB(skb)->flush |= flush;
>
> - skb_gro_postpull_rcsum(skb, iph, nlen);
> + if (NAPI_GRO_CB(skb)->csum_valid)
> + NAPI_GRO_CB(skb)->csum = ~ipv6_csum_partial(iph, nlen,
> + ~NAPI_GRO_CB(skb)->csum);
>
> pp = indirect_call_gro_receive_l4(tcp6_gro_receive, udp6_gro_receive,
> ops->callbacks.gro_receive, head, skb);