On 2/7/2010 1:14 PM, Mike Galbraith wrote:Disregard case 2 - was missing -O3. With -O3 or -O2 rep;nop and pause are identical. The interesting case is rep;pause which is different and seems more efficient.
, and this got me thinking... and testing... I think there's an optimization issue with gcc:
First of all - a bit of background on how I got here:
After reading the Intel documentation, I tried replacing rep:nop with pause (in theory exactly what's shown above). The system hung on booting.
I then tried replacing nop with pause (rep:pause) and the system booted. Using the above example, the opcode becomes f3 f3 90 vs f3 90 (rep nop).
Given the above compiler test case, this seemed odd, to say the least. So I played a bit more with gcc. Seems that the optimizer (-O3) is handling the *three*cases differently (objdump output)
Base code for all three cases (only change is the asm volitile line as shown for each case):
static inline void pause(void)
{
asm volatile("pause" ::: "memory");
}
void main(void)
{
pause();
}
Case1 - asm volatile("pause" ::: "memory");
0000000000400480 <main>:
400480: f3 90 pause
400482: c3 retq
400483: 90 nop
Case2 - asm volitile("rep;nop" ::: "memory") Note: this didn't inline!
0000000000400474 <pause>:
400474: 55 push %rbp
400475: 48 89 e5 mov %rsp,%rbp
400478: f3 90 pause
40047a: c9 leaveq
40047b: c3 retq
000000000040047c <main>:
40047c: 55 push %rbp
40047d: 48 89 e5 mov %rsp,%rbp
400480: e8 ef ff ff ff callq 400474 <pause>
400485: c9 leaveq
400486: c3 retq
400487: 90 nop
400488: 90 nop
400489: 90 nop
40048a: 90 nop
40048b: 90 nop
40048c: 90 nop
40048d: 90 nop
40048e: 90 nop
40048f: 90 nop
Case3 - asm volitile("rep;pause" ::: "memory")
0000000000400480 <main>:
400480: f3 f3 90 pause
400483: c3 retq
400484: 90 nop
_______
Note the difference between opcodes case 1 and case 3, and the mess made by the compiler in case 2.
As to benchmarks - I've checked a few things, no formal or lasting stuff... but striking at first glance:
1) At idle, perf top shows time spent in _raw_spin_lock dropping from ~35% to ~25%.
2) Running a media transcode (single core - handbrakecli): frame rate increased by about 5-10%.
3) During file-intensive operations (#2, above, or copying large files - ext4 on software raid6) - latencytop shows a decerase on writing a page to disc from about 120ms to about 90ms.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/