Re: [PATCH v2 3/3] perf report: Implement visual marker for macro fusion in annotate
From: Jin, Yao
Date: Mon Jun 19 2017 - 21:25:52 EST
Ok, thanks for making this per-arch! Some comments:
I think we should have this marked permanently, i.e. not just when we go
to the jump line, something like this (testing here in a t450s
broadwell, function hc_find_func, /usr/lib64/liblzma.so.5.2.2):
It is like this now, when we are not on the jne jump line:
0.71 â mov %r14d,%r10d â
â movzbl (%rdx,%r10,1),%ebp â
1.06 â 70: mov (%r9,%rcx,4),%ecx â
77.98 â 74: cmp %bpl,(%rbx,%r10,1) â
â â jne 70 â
0.85 â movzbl (%rdx),%r10d â
0.99 â cmp %r10b,(%rbx) â
I think it should be augmented to:
0.71 â mov %r14d,%r10d â
â movzbl (%rdx,%r10,1),%ebp â
1.06 â 70: ââmov (%r9,%rcx,4),%ecx â
77.98 â 74: ââcmp %bpl,(%rbx,%r10,1) â
â â jne 70 â
0.85 â movzbl (%rdx),%r10d â
0.99 â cmp %r10b,(%rbx) â
I.e. no arrow, the two instructions that end up as one micro-op being
connected.
The fused instruction pairs are:
cmp + jcc
test + jcc
add + jcc
sub + jcc
and + jcc
inc + jcc
dec + jcc
Mov and cmp are not the fused instruction pair. So we don't need to
connect mov and cmp. I guess what Arnaldo wants is to connect two fused
instructions even we don't go to the jcc line. For example: a line is
connected between cmp and jne in above case.
I have thought about that. While the visualization may be not very good
because the original arrow before jne would be overwritten. So now I
just implement a way that joins the jump arrow when we go to the jcc
line. Another consideration is the fused instruction pairs are very
common instructions in code, if we mark them all, there may be too much.
And then this:
â âââcmpl $0x0,argp_program_version_hook
81.93 â âââje 20
â â lock cmpxchg %esi,0x38a9a4(%rip)
â ââ jne 29
â ââ jmp 43
11.47 â20:âââcmpxch %esi,0x38a999(%rip)
Would look better as:
â âââcmpl $0x0,argp_program_version_hook
81.93 â âââje 20
â â lock cmpxchg %esi,0x38a9a4(%rip)
â ââ jne 29
â ââ jmp 43
11.47 â20:âââcmpxch %esi,0x38a999(%rip)
Patch below, please test/ack :-)
I have tested. It's better! There is no space in the line. Thanks!
This was the low hanging fruit, having the:
1.06 â 70: ââmov (%r9,%rcx,4),%ecx â
77.98 â 74: ââcmp %bpl,(%rbx,%r10,1) â
Marker always there, not just when we have the cursor on top of one of
those lines remains to be coded.
My comment is as above.
But you state:
------------
Macro fusion merges two instructions to a single micro-op. Intel core
platform performs this hardware optimization under limited
circumstances.
------------
"Intel core", what about older arches, etc, don't you have to look at:
# cpudesc : Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz
# cpuid : GenuineIntel,6,61,4
present in the perf.data header (or in the running system, for things
like 'perf top') to make sure that this is a machine where such "macro
fusion" takes place?
- Arnaldo
Reference for macro fusion is the optimization guide,
http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html
2.3.2.1
â In Intel microarchitecture code name Nehalem: CMP, TEST.
â In Intel microarchitecture code name Sandy Bridge: CMP, TEST, ADD,
SUB, AND, INC, DEC
â These instructions can fuse if The first source / destination operand
is a register.
The second source operand (if exists) is one of: immediate, register, or
non RIP-relative memory.
The second instruction of the macro-fusable pair is a conditional branch.
We probably don't need the full rules, just a simple test for
CMP/TEST/ADD/SUB/AND/INC/DEC and second instruction a Jcc condition
branch. Also I don't think we need to distinguish Nehalem/Sandy Bridge
and other core platforms. A simple test may be acceptable.
Thanks!
Jin Yao
diff --git a/tools/perf/ui/browser.c b/tools/perf/ui/browser.c
index acba636bd165..9ef7677ae14f 100644
--- a/tools/perf/ui/browser.c
+++ b/tools/perf/ui/browser.c
@@ -756,8 +756,10 @@ void ui_browser__mark_fused(struct ui_browser *browser, unsigned int column,
ui_browser__gotorc(browser, end_row, column);
SLsmg_draw_hline(2);
ui_browser__gotorc(browser, end_row + 1, column - 1);
- SLsmg_draw_vline(1);
+ SLsmg_write_char(SLSMG_LTEE_CHAR);
} else {
+ ui_browser__gotorc(browser, end_row, column - 1);
+ SLsmg_write_char(SLSMG_LTEE_CHAR);
ui_browser__gotorc(browser, end_row, column);
SLsmg_draw_hline(2);
}