Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

From: Neil Horman
Date: Mon Oct 14 2013 - 16:29:09 EST


On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote:
>
> * Neil Horman <nhorman@xxxxxxxxxxxxx> wrote:
>
> > Sébastien Dugué reported to me that devices implementing ipoib (which
> > don't have checksum offload hardware were spending a significant amount
> > of time computing checksums. We found that by splitting the checksum
> > computation into two separate streams, each skipping successive elements
> > of the buffer being summed, we could parallelize the checksum operation
> > accros multiple alus. Since neither chain is dependent on the result of
> > the other, we get a speedup in execution (on hardware that has multiple
> > alu's available, which is almost ubiquitous on x86), and only a
> > negligible decrease on hardware that has only a single alu (an extra
> > addition is introduced). Since addition in commutative, the result is
> > the same, only faster
>
> This patch should really come with measurement numbers: what performance
> increase (and drop) did you get on what CPUs.
>
> Thanks,
>
> Ingo
>


So, early testing results today. I wrote a test module that, allocated a 4k
buffer, initalized it with random data, and called csum_partial on it 100000
times, recording the time at the start and end of that loop. Results on a 2.4
GHz Intel Xeon processor:

Without patch: Average execute time for csum_partial was 808 ns
With patch: Average execute time for csum_partial was 438 ns


I'm looking into hpa's suggestion to use alternate instructions where available
right now. I'll have more soon
Neil

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/