random: Benchamrking fast_mix2

From: George Spelvin
Date: Thu Jun 12 2014 - 00:13:32 EST

Next message: Stephen Rothwell: "linux-next: manual merge of the target-updates tree with the virtio tree"
Previous message: Andy Gross: "Re: [PATCH 1/4] spi: qup: Remove chip select function"
In reply to: Theodore Ts'o: "Re: drivers/char/random.c: More futzing about"
Next in thread: George Spelvin: "Re: random: Benchamrking fast_mix2"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

> I redid my numbers, and I can no longer reproduce the 7x slowdown. I
> do see that if you compile w/o -O2, fast_mix2 is twice as slow. But
> it's not 7x slower.

For my single-round, I needed to drop to 2 loops rather than 3 to match
the speed. That's in the source I posted, but I didn't point it out.

(It wasn't an attempt to be deceptive, that's just how I happened
to have left the file when I was experimenting with various options.
I figured if we were looking for 7x, 1.5x wasn't all that important.)

That explains some of the residual difference between our figures.

When developing, I was using a many-iteration benchmark, and I suspect it
fitted in the Ivy Bridge uop cache, which let it saturate the execution
resources.

Sorry for the premature alarm; I'll go back to work and find something
better.

I still get comparable speed for 2 loops and -O2:
$ cc -W -Wall -m32 -O2 -march=native random.c -o random32
# ./perftest ../spooky/random32
pool 1 = 85670974 e96b1f8f 51244abf 5863283f
pool 2 = 03564c6c eba81d03 55c77fa1 760374a7
0: 148 124 (-24)
1: 48 36 (-12)
2: 40 36 (-4)
3: 44 40 (-4)
4: 44 40 (-4)
5: 36 36 (+0)
6: 52 36 (-16)
7: 44 32 (-12)
8: 44 36 (-8)
9: 48 36 (-12)
$ cc -W -Wall -m64 -O2 -march=native random.c -o random64
# ./perftest ../spooky/random64
pool 1 = 85670974 e96b1f8f 51244abf 5863283f
pool 2 = 03564c6c eba81d03 55c77fa1 760374a7
0: 132 104 (-28)
1: 40 40 (+0)
2: 36 44 (+8)
3: 32 40 (+8)
4: 40 36 (-4)
5: 32 40 (+8)
6: 36 44 (+8)
7: 40 40 (+0)
8: 36 44 (+8)
9: 40 36 (-4)
$ cc -W -Wall -m32 -O3 -march=native random.c -o random32
# ./perftest ./random32
pool 1 = 85670974 e96b1f8f 51244abf 5863283f
pool 2 = 03564c6c eba81d03 55c77fa1 760374a7
0: 88 48 (-40)
1: 36 40 (+4)
2: 36 44 (+8)
3: 32 40 (+8)
4: 36 40 (+4)
5: 96 40 (-56)
6: 40 40 (+0)
7: 36 40 (+4)
8: 28 48 (+20)
9: 28 40 (+12)
$ cc -W -Wall -m64 -O3 -march=native random.c -o random64
# ./perftest ./random64
pool 1 = 85670974 e96b1f8f 51244abf 5863283f
pool 2 = 03564c6c eba81d03 55c77fa1 760374a7
0: 72 80 (+8)
1: 36 52 (+16)
2: 32 36 (+4)
3: 32 36 (+4)
4: 28 40 (+12)
5: 32 40 (+8)
6: 32 40 (+8)
7: 32 36 (+4)
8: 28 44 (+16)
9: 36 36 (+0)
$ cc -W -Wall -m32 -Os -march=native random.c -o random32
# ./perftest ./random32
pool 1 = 85670974 e96b1f8f 51244abf 5863283f
pool 2 = 03564c6c eba81d03 55c77fa1 760374a7
0: 108 132 (+24)
1: 44 44 (+0)
2: 76 40 (-36)
3: 44 48 (+4)
4: 36 40 (+4)
5: 32 44 (+12)
6: 40 56 (+16)
7: 44 36 (-8)
8: 44 40 (-4)
9: 32 40 (+8)
$ $ cc -W -Wall -m64 -Os -march=native random.c -o random64
# ./perftest ./random64
pool 1 = 85670974 e96b1f8f 51244abf 5863283f
pool 2 = 03564c6c eba81d03 55c77fa1 760374a7
0: 96 108 (+12)
1: 44 52 (+8)
2: 40 40 (+0)
3: 40 36 (-4)
4: 40 32 (-8)
5: 36 36 (+0)
6: 44 32 (-12)
7: 36 36 (+0)
8: 40 36 (-4)
9: 40 36 (-4)

Yours looks much more careful about the timing.

A few GCC warnings I ended up fixing:
1) "volatile" on rdtsc is meaningless and ignore (with a warning)
2) fast_mix2() needs a void return type; it defaults to int.
3) int main() needs a "return 0"

Here's what I got running *your* program, unmodified except
for the above (meaning 3 inner loop iterations).
Compiled with GCC 4.9.0 (Devian 4.9.0-6), -O2.

i7-4940K# ./perftest ./ted32
fast_mix: 430 fast_mix2: 431
fast_mix: 442 fast_mix2: 464
fast_mix: 442 fast_mix2: 465
fast_mix: 442 fast_mix2: 431
fast_mix: 442 fast_mix2: 465
fast_mix: 431 fast_mix2: 430
fast_mix: 442 fast_mix2: 431
fast_mix: 431 fast_mix2: 465
fast_mix: 431 fast_mix2: 465
fast_mix: 431 fast_mix2: 431
i7-4940K# ./perftest ./ted64
fast_mix: 454 fast_mix2: 465
fast_mix: 453 fast_mix2: 465
fast_mix: 442 fast_mix2: 464
fast_mix: 453 fast_mix2: 464
fast_mix: 454 fast_mix2: 465
fast_mix: 453 fast_mix2: 465
fast_mix: 442 fast_mix2: 464
fast_mix: 453 fast_mix2: 464
fast_mix: 453 fast_mix2: 464
fast_mix: 453 fast_mix2: 465

In other words, pretty damn near the same
speed (with 3 loops).

So we still have some discrepancy to track down.

A few other machines.
i5-3330$ /tmp/ted32
fast_mix: 226 fast_mix2: 277
fast_mix: 561 fast_mix2: 429
fast_mix: 156 fast_mix2: 406
fast_mix: 504 fast_mix2: 534
fast_mix: 579 fast_mix2: 270
fast_mix: 240 fast_mix2: 270
fast_mix: 494 fast_mix2: 270
fast_mix: 240 fast_mix2: 138
fast_mix: 750 fast_mix2: 277
fast_mix: 124 fast_mix2: 270
i5-3330$ /tmp/ted64
fast_mix: 224 fast_mix2: 277
fast_mix: 226 fast_mix2: 312
fast_mix: 646 fast_mix2: 276
fast_mix: 233 fast_mix2: 456
fast_mix: 591 fast_mix2: 570
fast_mix: 413 fast_mix2: 563
fast_mix: 584 fast_mix2: 270
fast_mix: 231 fast_mix2: 261
fast_mix: 233 fast_mix2: 459
fast_mix: 528 fast_mix2: 277

Pentium4$ /tmp/ted32
fast_mix: 912 fast_mix2: 396
fast_mix: 792 fast_mix2: 160
fast_mix: 524 fast_mix2: 160
fast_mix: 1460 fast_mix2: 440
fast_mix: 496 fast_mix2: 160
fast_mix: 672 fast_mix2: 160
fast_mix: 700 fast_mix2: 160
fast_mix: 336 fast_mix2: 540
fast_mix: 896 fast_mix2: 160
fast_mix: 1052 fast_mix2: 156

Phemom9850$ /tmp/ted32
fast_mix: 463 fast_mix2: 158
fast_mix: 276 fast_mix2: 174
fast_mix: 194 fast_mix2: 135
fast_mix: 620 fast_mix2: 424
fast_mix: 584 fast_mix2: 424
fast_mix: 610 fast_mix2: 418
fast_mix: 651 fast_mix2: 1107
fast_mix: 634 fast_mix2: 439
fast_mix: 632 fast_mix2: 456
fast_mix: 534 fast_mix2: 205
Phemom9850$ /tmp/ted64
fast_mix: 783 fast_mix2: 185
fast_mix: 903 fast_mix2: 144
fast_mix: 955 fast_mix2: 178
fast_mix: 515 fast_mix2: 437
fast_mix: 642 fast_mix2: 580
fast_mix: 610 fast_mix2: 525
fast_mix: 523 fast_mix2: 119
fast_mix: 180 fast_mix2: 315
fast_mix: 596 fast_mix2: 570
fast_mix: 598 fast_mix2: 775

AthlonXP$ /tmp/ted32
fast_mix: 119 fast_mix2: 113
fast_mix: 139 fast_mix2: 109
fast_mix: 155 fast_mix2: 123
fast_mix: 134 fast_mix2: 140
fast_mix: 126 fast_mix2: 154
fast_mix: 134 fast_mix2: 113
fast_mix: 176 fast_mix2: 140
fast_mix: 145 fast_mix2: 113
fast_mix: 134 fast_mix2: 144
fast_mix: 155 fast_mix2: 112

So I'm still a bit confused. Would any bystanders like to
chip in? Ted, shall I send you some binaries?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Stephen Rothwell: "linux-next: manual merge of the target-updates tree with the virtio tree"
Previous message: Andy Gross: "Re: [PATCH 1/4] spi: qup: Remove chip select function"
In reply to: Theodore Ts'o: "Re: drivers/char/random.c: More futzing about"
Next in thread: George Spelvin: "Re: random: Benchamrking fast_mix2"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]