Re: [PATCH 2/2] perf bench: add x86-64 specific benchmarks to perfbench mem memcpy

From: Hitoshi Mitake
Date: Tue Jan 11 2011 - 11:27:56 EST


On 2010å11æ01æ 18:02, Ingo Molnar wrote:

* Hitoshi Mitake<mitake@xxxxxxxxxxxxxxxxxxxxx> wrote:

On 2010å10æ31æ 04:23, Ingo Molnar wrote:

* Hitoshi Mitake<mitake@xxxxxxxxxxxxxxxxxxxxx> wrote:

This patch adds new file: mem-memcpy-x86-64-asm.S
for x86-64 specific memcpy() benchmarking.
Added new benchmarks are,
x86-64-rep: memcpy() implemented with rep instruction
x86-64-unrolled: unrolled memcpy()

Original idea of including the source files of kernel
for benchmarking is suggested by Ingo Molnar.
This is more effective than write-once programs for quantitative
evaluation of in-kernel, little and leaf functions called high frequently.
Because perf bench is in kernel source tree and executing it
on various hardwares, especially new model CPUs, is easy.

This way can also be used for other functions of kernel e.g. checksum functions.

Example of usage on Core i3 M330:

| % ./perf bench mem memcpy -l 500MB
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes from 0x7f911f94c010 to 0x7f913ed4d010 ...
|
| 578.732506 MB/Sec
| % ./perf bench mem memcpy -l 500MB -r x86-64-rep
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes from 0x7fb4b6fe4010 to 0x7fb4d63e5010 ...
|
| 738.184980 MB/Sec
| % ./perf bench mem memcpy -l 500MB -r x86-64-unrolled
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes from 0x7f6f2e668010 to 0x7f6f4da69010 ...
|
| 767.483269 MB/Sec

This shows clearly that unrolled memcpy() is efficient
than rep version and glibc's one :)

Hey, really cool output :-)

Might also make sense to measure Ma Ling's patched version?

Does Ma Ling's patched version mean,

http://marc.info/?l=linux-kernel&m=128652296500989&w=2

the memcpy applied the patch of the URL?
(It seems that this patch was written by Miao Xie.)

I'll include the result of patched version in the next post.

(Indeed it is Miao Xie - sorry!)

# checkpatch.pl warns about two externs in bench/mem-memcpy.c
# added by this patch. But I think it is no problem.

You should put these:

+#ifdef ARCH_X86_64
+extern void *memcpy_x86_64_unrolled(void *to, const void *from, size_t len);
+extern void *memcpy_x86_64_rep(void *to, const void *from, size_t len);
+#endif

into a .h file - a new one if needed.

That will make both checkpatch and me happier ;-)


OK, I'll separate these files.

BTW, I found really interesting evaluation result.
Current results of "perf bench mem memcpy" include
the overhead of page faults because the measured memcpy()
is the first access to allocated memory area.

I tested the another version of perf bench mem memcpy,
which does memcpy() before measured memcpy() for removing
the overhead come from page faults.

And this is the result:

% ./perf bench mem memcpy -l 500MB -r x86-64-unrolled
# Running mem/memcpy benchmark...
# Copying 500MB Bytes from 0x7f19d488f010 to 0x7f19f3c90010 ...

4.608340 GB/Sec

% ./perf bench mem memcpy -l 500MB
# Running mem/memcpy benchmark...
# Copying 500MB Bytes from 0x7f696c3cc010 to 0x7f698b7cd010 ...

4.856442 GB/Sec

% ./perf bench mem memcpy -l 500MB -r x86-64-rep
# Running mem/memcpy benchmark...
# Copying 500MB Bytes from 0x7f45d6cff010 to 0x7f45f6100010 ...

6.024445 GB/Sec

The relation of scores reversed!
I cannot explain the cause of this result, and
this is really interesting phenomenon.

Interesting indeed, and it would be nice to analyse that! (It should be possible,
using various PMU metrics in a clever way, to figure out what's happening inside the
CPU, right?)


I corrected the PMU information of the each case of memcpy,
below is the result:

(I used partial monitoring patch I posted before: https://patchwork.kernel.org/patch/408801/,
and my local modification for testing rep based memcpy)

no prefault benchmarking

unrolled

Score: 685.812729 MB/Sec
Stat:
Performance counter stats for process id '4139':

725.939831 task-clock-msecs # 0.995 CPUs
74 context-switches # 0.000 M/sec
2 CPU-migrations # 0.000 M/sec
256,002 page-faults # 0.353 M/sec
1,535,468,702 cycles # 2115.146 M/sec
1,691,516,817 instructions # 1.102 IPC
291,260,006 branches # 401.218 M/sec
1,487,762 branch-misses # 0.511 %
8,470,560 cache-references # 11.668 M/sec
8,364,176 cache-misses # 11.522 M/sec

0.729488573 seconds time elapsed

rep based

Score: 670.172114 MB/Sec
Stat:
Performance counter stats for process id '5539':

742.943772 task-clock-msecs # 0.995 CPUs
77 context-switches # 0.000 M/sec
2 CPU-migrations # 0.000 M/sec
256,002 page-faults # 0.345 M/sec
1,578,787,149 cycles # 2125.043 M/sec
1,499,144,628 instructions # 0.950 IPC
275,684,806 branches # 371.071 M/sec
1,522,326 branch-misses # 0.552 %
8,503,747 cache-references # 11.446 M/sec
8,386,673 cache-misses # 11.288 M/sec

0.746320411 seconds time elapsed

prefaulted benchmarking

unrolled

Score: 4.485941 GB/Sec
Stat:
Performance counter stats for process id '4279':

108.466761 task-clock-msecs # 0.994 CPUs
11 context-switches # 0.000 M/sec
2 CPU-migrations # 0.000 M/sec
2 page-faults # 0.000 M/sec
218,260,432 cycles # 2012.233 M/sec
199,520,023 instructions # 0.914 IPC
16,963,327 branches # 156.392 M/sec
8,169 branch-misses # 0.048 %
2,955,221 cache-references # 27.245 M/sec
2,916,018 cache-misses # 26.884 M/sec

0.109115820 seconds time elapsed

rep based

Score: 5.972859 GB/Sec
Stat:
Performance counter stats for process id '5535':

81.609445 task-clock-msecs # 0.995 CPUs
8 context-switches # 0.000 M/sec
0 CPU-migrations # 0.000 M/sec
2 page-faults # 0.000 M/sec
173,888,853 cycles # 2130.744 M/sec
3,034,096 instructions # 0.017 IPC
607,897 branches # 7.449 M/sec
5,874 branch-misses # 0.966 %
8,276,533 cache-references # 101.416 M/sec
8,274,865 cache-misses # 101.396 M/sec

0.082030877 seconds time

Again, the surprising point is the reverse of the score relation.
I cannot find the direct reason of this reverse,
but it seems that the count of branch-miss is refrecting it.

I have to look into this more deeply...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/