Re: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c

From: Noah Goldstein
Date: Sun Nov 28 2021 - 16:01:50 EST


On Sun, Nov 28, 2021 at 1:47 PM David Laight <David.Laight@xxxxxxxxxx> wrote:
>
> ...
> > Regarding the 32 byte case, adding two accumulators helps with the latency
> > numbers but causes a regression in throughput for the 40/48 byte cases. Which
> > is the more important metric for the usage of csum_partial()?
> >
> > Here are the numbers for the smaller sizes:
> >
> > size, lat old, lat ver2, lat ver1, tput old, tput ver2, tput ver1
> > 0, 4.961, 4.503, 4.901, 4.887, 4.399, 4.951
> > 8, 5.590, 5.594, 5.620, 4.227, 4.110, 4.252
> > 16, 6.182, 6.398, 6.202, 4.233, 4.062, 4.278
> > 24, 7.392, 7.591, 7.380, 4.256, 4.246, 4.279
> > 32, 7.371, 6.366, 7.390, 4.550, 4.900, 4.537
> > 40, 8.621, 7.496, 8.601, 4.862, 5.162, 4.836
> > 48, 9.406, 8.128, 9.374, 5.206, 5.736, 5.234
> > 56, 10.535, 9.189, 10.522, 5.416, 5.772, 5.447
> > 64, 10.000, 7.487, 7.590, 6.946, 6.975, 6.989
> > 72, 11.192, 8.639, 8.763, 7.210, 7.311, 7.277
> > 80, 11.734, 9.179, 9.409, 7.605, 7.620, 7.548
> > 88, 12.933, 10.545, 10.584, 7.878, 7.902, 7.858
> > 96, 12.952, 9.331, 10.625, 8.168, 8.470, 8.206
> > 104, 14.206, 10.424, 11.839, 8.491, 8.785, 8.502
> > 112, 14.763, 11.403, 12.416, 8.798, 9.134, 8.771
> > 120, 15.955, 12.635, 13.651, 9.175, 9.494, 9.130
> > 128, 15.271, 10.599, 10.724, 9.726, 9.672, 9.655
> >
> > 'ver2' uses two accumulators for 32 byte case and has better latency numbers
> > but regressions in tput compared to 'old' and 'ver1'. 'ver1' is the
> > implementation
> > posted which has essentially the same numbers for tput/lat as 'old'
> > for sizes [0, 63].
>
> Which cpu are you testing on - it will make a big difference ?

Tigerlake, although assuming `adc` as the bottleneck, the results
should be largely independent.

> And what are you measing throughput in?

Running back to back iterations with the same input without any
dependency between iterations. The OoO window will include
multiple iterations at once.

> And are you testing aligned or mis-aligned 64bit reads?

Aligned as that is the common case.

>
> I think one of the performance counters will give 'cpu clocks'.

Time is in Ref Cycles using `rdtsc`
>
> I did some tests early last year and got 8 bytes/clock on broadwell/haswell
> with code that 'loop carried' the carry flag (a single adc chain).
> On the older Intel cpu (Ivy bridge onwards) 'adc' has a latency of 2
> for the result, but the carry flag is available earlier.
> So alternating the target register in the 'adc' chain will give (nearly)
> 8 bytes/clock - I think I got to 7.5.
>
> That can all be done with only 4 reads per interaction.
> IIRC broadwell/haswell only need 2 reads/iteration.
>
> It is actually likely (certainly worth checking) that haswell/broadwell
> can do two misaligned memory reads every clock.
> So it may not be worth aligning the reads (which the old code did).
> In any case aren't tx packets likely to be aligned, and rx ones
> misaligned to some known 4n+2 boundary?
>
> Using adxc/adxo together is a right PITA.

I'm a bit hesitant about adxc/adxo because they are extensions so
support will need to be tested.

> I did get (about) 12 bytes/clock fo long buffers while loop carrying
> both the overflow and carry flags.
>
> Also is there a copy of the patched code anywhere?

https://lore.kernel.org/lkml/CANn89iLpFOok_tv=DKsLX1mxZGdHQgATdW4Xs0rc6oaXQEa5Ng@xxxxxxxxxxxxxx/T/


> I think I've missed some of the patches and they are difficult to follow.
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)