Re: [PATCH 1/4] tracing: move __DO_TRACE out of line

From: Jeremy Fitzhardinge
Date: Sun Apr 19 2009 - 19:39:02 EST


Mathieu Desnoyers wrote:
Here is the conclusions I gather from the following tbench tests on the LTTng
tree :

- Dormant tracepoints, when sprinkled all over the place, have a very small, but
measurable, footprint on kernel stress-test workloads (3 % for the
whole 2.6.30-rc1 LTTng tree).

- "Immediate values" help lessening this impact significantly (3 % -> 2.5 %).

- Static jump patching would diminish impact even more, but would require gcc
modifications to be acceptable. I did some prototypes using instruction
pattern matching in the past which was judged too complex.

- I strongly recommend adding per-subsystem config-out option for heavy
users like kmemtrace or pvops. Compiling-out kmemtrace instrumentation
brings the performance impact from 2.5 % down to 1.9 % slowdown.

- Putting the tracepoint out-of-line is a no-go, as it slows down *both* the
dormant (3 % -> 4.7 %) and the active (+20% to tracer overhead) tracepoints
compared to inline tracepoints.

That's an interestingly counter-intuitive result. Do you have any theories how this might happen? The only mechanism I can think of is that, because the inline code sections are smaller, gcc is less inclined to put the if(unlikely) code out of line, so the amount of hot-patch code is higher. But still, 1.7% is a massive increase in overhead, especially compared to the relative differences of the other changes.

Tracepoints all compiled-out :

run 1 : 2091.50
run 2 (after reboot) : 2089.50 (baseline)
run 3 (after reboot) : 2083.61

Dormant tracepoints :

inline, no immediate value optimization

run 1 : 1990.63
run 2 (after reboot) : 2025.38 (3 %)
run 3 (after reboot) : 2028.81

out-of-line, no immediate value optimization

run 1 : 1990.66
run 2 (after reboot) : 1990.19 (4.7 %)
run 3 (after reboot) : 1977.79

inline, immediate value optimization

run 1 : 2035.99 (2.5 %)
run 2 (after reboot) : 2036.11
run 3 (after reboot) : 2035.75

inline, immediate value optimization, configuring out kmemtrace tracepoints

run 1 : 2048.08 (1.9 %)
run 2 (after reboot) : 2055.53
run 3 (after reboot) : 2046.49

So what are you doing here? Are you doing 3 runs, then comparing he median measurement in each case?

The trouble is that your run to run variations are at least as large as the difference you're trying to detect. For example in run 1 of "inline, no immediate value optimization" you got 1990.6MB/s throughput, and then runs 2 & 3 both went up to ~2025. Why? That's a huge jump.

The "out-of-line, no immediate value optimization" runs 1&2 has the same throughput as run 1 of the previous test, 1990MB/s, while run 3 is a bit worse. OK, so perhaps its slower. But why are runs 1&2 more or less identical to inline/run1?

What would happen if you happened to do 10 iterations of these tests? There just seems like too much run to run variation to make 3 runs statistically meaningful.

I'm not picking on you personally, because I had exactly the same problems when trying to benchmark the overhead of pvops. The reboot/rerun variations were at least as large as the effects I'm trying to measure, and I'm just feeling suspicious of all the results.

I think there's something fundimentally off about about this kind of kernel benchmark methodology. The results are not stable and are not - I think - reliable. Unfortunately I don't have enough of a background in statistics to really analyze what's going on here, or how we should change the test/measurement methodology to get results that we can really stand by.

I don't even have a good explanation for why there are such large boot-to-boot variations anyway. The normal explanation is "cache effects", but what is actually changing here? The kernel image is identical, loaded into the same physical pages each time, and mapped into the same virtual address. So the I&D caches and tlb should get exactly the same access patterns for the kernel code itself. The dynamically allocated memory is going to vary, and have different cache interactions, but is that enough to explain these kinds of variations? If so, we're going to need to do a lot more iterations to see any signal from our actual changes over the noise that "cache effects" are throwing our way...

J
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/