Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

From: Ingo Molnar
Date: Fri Nov 01 2013 - 05:13:46 EST

Next message: NeilBrown: "Re: [PATCH] regulator: check for devicetree early."
Previous message: Shawn Guo: "Re: [PATCH v4 0/4] Add dual-fifo mode support of i.MX ssi"
Next in thread: Neil Horman: "Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

* Neil Horman <nhorman@xxxxxxxxxxxxx> wrote:

> On Thu, Oct 31, 2013 at 11:22:00AM +0100, Ingo Molnar wrote:
> >
> > * Neil Horman <nhorman@xxxxxxxxxxxxx> wrote:
> >
> > > > etc. For such short runtimes make sure the last column displays
> > > > close to 100%, so that the PMU results become trustable.
> > > >
> > > > A nehalem+ PMU will allow 2-4 events to be measured in parallel,
> > > > plus generics like 'cycles', 'instructions' can be added 'for free'
> > > > because they get counted in a separate (fixed purpose) PMU register.
> > > >
> > > > The last colum tells you what percentage of the runtime that
> > > > particular event was actually active. 100% (or empty last column)
> > > > means it was active all the time.
> > > >
> > > > Thanks,
> > > >
> > > > Ingo
> > > >
> > >
> > > Hmm,
> > >
> > > I ran this test:
> > >
> > > for i in `seq 0 1 3`
> > > do
> > > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > > taskset -c 0 perf stat --repeat 20 -C 0 -e L1-dcache-load-misses -e L1-dcache-prefetches -e cycles -e instructions -ddd ./test.sh
> > > done
> >
> > You need to remove '-ddd' which is a shortcut for a ton of useful
> > events, but here you want to use fewer events, to increase the
> > precision of the measurement.
> >
> > Thanks,
> >
> > Ingo
> >
>
> Thank you ingo, that fixed it. I'm trying some other variants of
> the csum algorithm that Doug and I discussed last night, but FWIW,
> the relative performance of the 4 test cases
> (base/prefetch/parallel/both) remains unchanged. I'm starting to
> feel like at this point, theres very little point in doing
> parallel alu operations (unless we can find a way to break the
> dependency on the carry flag, which is what I'm tinkering with
> now).

I would still like to encourage you to pick up the improvements that
Doug measured (mostly via prefetch tweaking?) - that looked like
some significant speedups that we don't want to lose!

Also, trying to stick the in-kernel implementation into 'perf bench'
would be a useful first step as well, for this and future efforts.

See what we do in tools/perf/bench/mem-memcpy-x86-64-asm.S to pick
up the in-kernel assembly memcpy implementations:

#define memcpy MEMCPY /* don't hide glibc's memcpy() */
#define altinstr_replacement text
#define globl p2align 4; .globl
#define Lmemcpy_c globl memcpy_c; memcpy_c
#define Lmemcpy_c_e globl memcpy_c_e; memcpy_c_e

#include "../../../arch/x86/lib/memcpy_64.S"

So it needed a bit of trickery/wrappery for 'perf bench mem memcpy',
but that is a one-time effort - once it's done then the current
in-kernel csum_partial() implementation would be easily measurable
(and any performance regression in it bisectable, etc.) from that
point on.

In user-space it would also be easier to add various parameters and
experimental implementations and background cache-stressing
workloads automatically.

Something similar might be possible for csum_partial(),
csum_partial_copy*(), etc.

Note, if any of you ventures to add checksum-benchmarking to perf
bench, please base any patches on top of tip:perf/core:

git pull git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf/core

as there are a couple of perf bench enhancements in the pipeline
already for v3.13.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: NeilBrown: "Re: [PATCH] regulator: check for devicetree early."
Previous message: Shawn Guo: "Re: [PATCH v4 0/4] Add dual-fifo mode support of i.MX ssi"
Next in thread: Neil Horman: "Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]