Re: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c
From: Noah Goldstein
Date: Fri Nov 26 2021 - 14:31:50 EST
On Fri, Nov 26, 2021 at 12:27 PM Eric Dumazet <edumazet@xxxxxxxxxx> wrote:
>
> On Fri, Nov 26, 2021 at 10:17 AM Noah Goldstein <goldstein.w.n@xxxxxxxxx> wrote:
> >
>
> >
> > Makes sense. Although if you inline I think you definitely will want a more
> > conservative clobber than just "memory". Also I think with 40 you also will
> > get some value from two counters.
> >
> > Did you see the number/question I posted about two accumulators for 32
> > byte case?
> > Its a judgement call about latency vs throughput that I don't really have an
> > answer for.
> >
>
> The thing I do not know is if using more units would slow down the
> hyper thread ?
There are more uops in the two accumulator version so it could be concern
iff the other hyperthread is bottlenecked on p06 throughput. My general
understanding is this is not the common case and that the very premise of
hyperthreads is that most bottlenecks are related to memory fetch or resolving
control flow.
>
> Would using ADCX/ADOX would be better in this respect ?
What would code using those instructions look like? Having trouble
seeing how to use them here.