RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
From: David Laight
Date: Wed Feb 10 2016 - 10:21:07 EST
From: George Spelvin
> Sent: 10 February 2016 14:44
...
> > I think the fastest loop is:
> > 10: adcq 0(%rdi,%rcx,8),%rax
> > inc %rcx
> > jnz 10b
> > That loop looks like it will have no overhead on recent cpu.
>
> Well, it should execute at 1 instruction/cycle.
I presume you do mean 1 adc/cycle.
If it doesn't unrolling once might help.
> (No, a scaled offset doesn't take extra time.)
Maybe I'm remembering the 386 book.
> To break that requires ADCX/ADOX:
>
> 10: adcxq 0(%rdi,%rcx),%rax
> adoxq 8(%rdi,%rcx),%rdx
> leaq 16(%rcx),%rcx
> jrcxz 11f
> j 10b
> 11:
Getting 2 adc/cycle probably does require a little unrolling.
With luck the adcxq, adoxq and leaq will execute together.
The jrcxz is two clocks - so definitely needs a second adcoxq/adcxq pair.
Experiments would be needed to confirm guesses though.
David