Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

From: Ingo Molnar
Date: Thu Oct 17 2013 - 04:41:34 EST

Next message: Joe Perches: "Re: [PATCH v2 3/9] bitops: Introduce a more generic BITMASK macro"
Previous message: Borislav Petkov: "Re: [PATCH v2 3/9] bitops: Introduce a more generic BITMASK macro"
In reply to: Neil Horman: "Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's"
Next in thread: H. Peter Anvin: "Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

* Neil Horman <nhorman@xxxxxxxxxxxxx> wrote:

> On Mon, Oct 14, 2013 at 03:18:47PM -0700, Eric Dumazet wrote:
> > On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote:
> > > On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
> > >
> > > > So, early testing results today. I wrote a test module that, allocated a 4k
> > > > buffer, initalized it with random data, and called csum_partial on it 100000
> > > > times, recording the time at the start and end of that loop. Results on a 2.4
> > > > GHz Intel Xeon processor:
> > > >
> > > > Without patch: Average execute time for csum_partial was 808 ns
> > > > With patch: Average execute time for csum_partial was 438 ns
> > >
> > > Impressive, but could you try again with data out of cache ?
> >
> > So I tried your patch on a GRE tunnel and got following results on a
> > single TCP flow. (short result : no visible difference)
> >
> >
>
> So I went to reproduce these results, but was unable to (due to the fact that I
> only have a pretty jittery network to do testing accross at the moment with
> these devices). So instead I figured that I would go back to just doing
> measurements with the module that I cobbled together (operating under the
> assumption that it would give me accurate, relatively jitter free results (I've
> attached the module code for reference below). My results show slightly
> different behavior:
>
> Base results runs:
> 89417240
> 85170397
> 85208407
> 89422794
> 91645494
> 103655144
> 86063791
> 75647774
> 83502921
> 85847372
> AVG = 875 ns
>
> Prefetch only runs:
> 70962849
> 77555099
> 81898170
> 68249290
> 72636538
> 83039294
> 78561494
> 83393369
> 85317556
> 79570951
> AVG = 781 ns
>
> Parallel addition only runs:
> 42024233
> 44313064
> 48304416
> 64762297
> 42994259
> 41811628
> 55654282
> 64892958
> 55125582
> 42456403
> AVG = 510 ns
>
>
> Both prefetch and parallel addition:
> 41329930
> 40689195
> 61106622
> 46332422
> 49398117
> 52525171
> 49517101
> 61311153
> 43691814
> 49043084
> AVG = 494 ns
>
>
> For reference, each of the above large numbers is the number of
> nanoseconds taken to compute the checksum of a 4kb buffer 100000 times.
> To get my average results, I ran the test in a loop 10 times, averaged
> them, and divided by 100000.
>
> Based on these, prefetching is obviously a a good improvement, but not
> as good as parallel execution, and the winner by far is doing both.

But in the actual usecase mentioned the packet data was likely cache-cold,
it just arrived in the NIC and an IRQ got sent. Your testcase uses a
super-hot 4K buffer that fits into the L1 cache. So it's apples to
oranges.

To correctly simulate the workload you'd have to:

- allocate a buffer larger than your L2 cache.

- to measure the effects of the prefetches you'd also have to randomize
the individual buffer positions. See how 'perf bench numa' implements a
random walk via --data_rand_walk, in tools/perf/bench/numa.c.
Otherwise the CPU might learn your simplistic stream direction and the
L2 cache might hw-prefetch your data, interfering with any explicit
prefetches the code does. In many real-life usecases packet buffers are
scattered.

Also, it would be nice to see standard deviation noise numbers when two
averages are close to each other, to be able to tell whether differences
are statistically significant or not.

For example 'perf stat --repeat' will output stddev for you:

comet:~/tip> perf stat --repeat 20 --null bash -c 'usleep $((RANDOM*10))'

Performance counter stats for 'bash -c usleep $((RANDOM*10))' (20 runs):

0.189084480 seconds time elapsed ( +- 11.95% )

The last '+-' percentage is the noise of the measurement.

Also note that you can inspect many cache behavior details of your
algorithm via perf stat - the -ddd option will give you a laundry list:

aldebaran:~> perf stat --repeat 20 -ddd perf bench sched messaging
...

Total time: 0.095 [sec]

Performance counter stats for 'perf bench sched messaging' (20 runs):

1519.128721 task-clock (msec) # 12.305 CPUs utilized ( +- 0.34% )
22,882 context-switches # 0.015 M/sec ( +- 2.84% )
3,927 cpu-migrations # 0.003 M/sec ( +- 2.74% )
16,616 page-faults # 0.011 M/sec ( +- 0.17% )
2,327,978,366 cycles # 1.532 GHz ( +- 1.61% ) [36.43%]
1,715,561,189 stalled-cycles-frontend # 73.69% frontend cycles idle ( +- 1.76% ) [38.05%]
715,715,454 stalled-cycles-backend # 30.74% backend cycles idle ( +- 2.25% ) [39.85%]
1,253,106,346 instructions # 0.54 insns per cycle
# 1.37 stalled cycles per insn ( +- 1.71% ) [49.68%]
241,181,126 branches # 158.763 M/sec ( +- 1.43% ) [47.83%]
4,232,053 branch-misses # 1.75% of all branches ( +- 1.23% ) [48.63%]
431,907,354 L1-dcache-loads # 284.313 M/sec ( +- 1.00% ) [48.37%]
20,550,528 L1-dcache-load-misses # 4.76% of all L1-dcache hits ( +- 0.82% ) [47.61%]
7,435,847 LLC-loads # 4.895 M/sec ( +- 0.94% ) [36.11%]
2,419,201 LLC-load-misses # 32.53% of all LL-cache hits ( +- 2.93% ) [ 7.33%]
448,638,547 L1-icache-loads # 295.326 M/sec ( +- 2.43% ) [21.75%]
22,066,490 L1-icache-load-misses # 4.92% of all L1-icache hits ( +- 2.54% ) [30.66%]
475,557,948 dTLB-loads # 313.047 M/sec ( +- 1.96% ) [37.96%]
6,741,523 dTLB-load-misses # 1.42% of all dTLB cache hits ( +- 2.38% ) [37.05%]
1,268,628,660 iTLB-loads # 835.103 M/sec ( +- 1.75% ) [36.45%]
74,192 iTLB-load-misses # 0.01% of all iTLB cache hits ( +- 2.88% ) [36.19%]
4,466,526 L1-dcache-prefetches # 2.940 M/sec ( +- 1.61% ) [36.17%]
2,396,311 L1-dcache-prefetch-misses # 1.577 M/sec ( +- 1.55% ) [35.71%]

0.123459566 seconds time elapsed ( +- 0.58% )

There's also a number of prefetch counters that might be useful:

aldebaran:~> perf list | grep prefetch
L1-dcache-prefetches [Hardware cache event]
L1-dcache-prefetch-misses [Hardware cache event]
LLC-prefetches [Hardware cache event]
LLC-prefetch-misses [Hardware cache event]
node-prefetches [Hardware cache event]
node-prefetch-misses [Hardware cache event]

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Joe Perches: "Re: [PATCH v2 3/9] bitops: Introduce a more generic BITMASK macro"
Previous message: Borislav Petkov: "Re: [PATCH v2 3/9] bitops: Introduce a more generic BITMASK macro"
In reply to: Neil Horman: "Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's"
Next in thread: H. Peter Anvin: "Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]