Re: [patch] Performance Counters for Linux, v2
From: Ingo Molnar
Date: Tue Dec 09 2008 - 08:47:24 EST
* stephane eranian <eranian@xxxxxxxxxxxxxx> wrote:
> > There's a new "counter group record" facility that is a
> > straightforward extension of the existing "irq record" notification
> > type. This record type can be set on a 'master' counter, and if the
> > master counter triggers an IRQ or an NMI, all the 'secondary'
> > counters are read out atomically and are put into the counter-group
> > record. The result can then be read() out by userspace via a single
> > system call. (Based on extensive feedback from Paul Mackerras and
> > David Miller, thanks guys!)
>
> That is unfortunately not generic enough. You need a bit more
> flexibility than master/secondaries, I am afraid. What tools want is
> to be able to express:
>
> - when event X overflows, record values of events J, K
> - when event Y overflows, record values of events Z, J
hm, the new group code in perfcounters-v2 can already do this. Have you
tried to use it and it didnt work? If so then that's a bug. Nothing in
the design prevents that kind of group readout.
[ We could (and probably will) enhance the grouping relationship some
more, but group readouts are a fundamentally inferior mode of
profiling. (see below for the explanation) ]
> I am not making this up. I know tools that do just that, i.e., that is
> collecting two distinct profiles in a single run. This is how, for
> instance, you can collect a flat profile and the call graph in one run,
> very much like gprof.
yeah, but it's still the fundamentally wrong thing to do.
Being able to extract high-quality performance information from the
system is the cornerstone of our design, and chosing the right sampling
model permeates the whole issue of single-counter versus group-readout.
I dont think finer design aspects of kernel support for performance
counters can be argued without being on the same page about this, so
please let me outline our view on these things, in (boringly) verbose
detail - spiked with examples and code as well.
Firstly, sampling "at 1msec intervals" or any fixed period is a _very_
wrong mindset - and cross-sampling counters is a similarly wrong mindset.
When there are two (or more) hw metrics to profile, the ideally best
(i.e. the statistically most stable and most relevant) sampling for the
two statistical variables (say of l2_misses versus l2_accesses) is to
sample them independently, via their own metric. Not via a static 1khz
rate - or via picking one of the variables to generate samples.
[ Sidenote: as long as the hw supports such sort of independent sampling
- lets assume so for the sake of argument - not all CPUs are capable of
that - most modern CPUs do though. ]
Static frequency [time] sampling has a number of disadvantages that
drastically reduce its precision and reduce its utility, and 'group'
sampling where one counter controls the events has similar problems:
- It under-samples rare events such as cachemisses.
An example: say we have a workload that executes 1 billion instructions
a second, of which 5000 generate a cachemiss. Only one in 200,000
instructions generates a cachemiss. The chance for a static sampling
IRQ to hit exactly an instruction that causes the cachemiss is 1:200
(0.5%) in every second. That is very low probability, and the profile
would not be very helpful - even though it samples at a seemingly
adequate frequency of 1000 events per second!
With per event counters and per event sampling that KernelTop uses, we
get an event next to the instruction that causes a cachemiss with a
100% certainty, all the time. The profile and its per instruction
aspects suddenly become a whole lot more accurate and whole lot more
interesting.
- Static frequency and group sampling also runs the risk of systematic
error/skew of sampling if any workload component has any correlation
with the "1msec" global sampling period.
For example: say we profile a workload that runs a timer every 20
msecs. In such a case the profile could be skewed assymetrically
against [or in favor of] that timer activity that it does every 10
milliseconds.
Good sampling wants the samples to be generated in proportion to the
variable itself, not proportional to absolute time.
- Static sampling also over-samples when the workload activity goes
down (when it goes more idle).
For example: we profile a fluctuating workload that is sometimes only
0.2% busy, i.e. running only for 2 milliseconds every second. Still we
keep interrupting it at 1 khz - that can be a very brutal systematic
skew if the sampling overhead is 2 microseconds, totalling to 2 msecs
overhead every second - so 50% of what runs on the CPU will be sampling
code - impacting/skewing the sampled code.
Good sampling wants to 'follow' the ebb and flow of the actual hw
events that the CPU has.
The best way to sample two metrics such as "cache accesses" and "cache
misses" (or say "cache misses" versus "TLB misses") is to sample the two
variables _independently_, and to build independent histograms out of
them.
The combination (or 'grouping') of the measured variables is thus done at
the output stage _after_ data acquisition, to provide a weighted
histogram (or a split-view double histogram).
For example, in a "l2 misses" versus "l2 accesses" case, the highest
quality of sampling is to use two independent sampling IRQs with such
sampling parameters:
- one notification every 200 L2 cache misses
- one notification every 10,000 L2 cache accesses
[ this is a ballpark figure - the sample rate is a function of the
averages of the workload and the characteristics of the CPU. ]
And at the output stage display a combination of:
l2_accesses[pc]
l2_misses[pc]
l2_misses[pc] / l2_accesseses[pc]
Note that if we had a third variable as well - say icache_misses[], we
could combine the three metrics:
l2_misses[pc] / l2_accesses[pc] / icache_misses[pc]
( such a view expresses the miss/access ratio in a branch-weighted
fashion: it weighs down instructions that also show signs of icache
pressure and goes for the functions with a high dcache rate but low
icache pressure - i.e. commonly executed functions with a high data
miss rate. )
Sampling at a static frequency is acceptable as well in some cases, and
will lead to an output that is usable for some things. It's just not the
best sampling model, and it's not usable at all for certain important
things such as highly derived views, good instruction level profiles or
rare hw events.
I've uploaded a new version of kerneltop.c that has such a multi-counter
sampling model that follows this statistical model:
http://redhat.com/~mingo/perfcounters/kerneltop.c
Example of usage:
I've started a tbench 64 localhost workload on a 16way x86 box. I want to
check the miss/refs ratio. I first did a sample one of the metrics,
cache-references:
$ ./kerneltop -e 2 -c 100000 -C 2
------------------------------------------------------------------------------
KernelTop: 1311 irqs/sec [NMI, 10000 cache-refs], (all, cpu: 2)
------------------------------------------------------------------------------
events RIP kernel function
______ ________________ _______________
5717.00 - ffffffff803666c0 : copy_user_generic_string!
355.00 - ffffffff80507646 : tcp_sendmsg
315.00 - ffffffff8050abcb : tcp_ack
222.00 - ffffffff804fbb20 : ip_rcv_finish
215.00 - ffffffff8020a75b : __switch_to
194.00 - ffffffff804d0b76 : skb_copy_datagram_iovec
187.00 - ffffffff80502b5d : __inet_lookup_established
183.00 - ffffffff8051083d : tcp_transmit_skb
160.00 - ffffffff804e4fc9 : eth_type_trans
156.00 - ffffffff8026ae31 : audit_syscall_exit
Then i checked the characteristics of the other metric [cache-misses]:
$ ./kerneltop -e 3 -c 200 -C 2
------------------------------------------------------------------------------
KernelTop: 1362 irqs/sec [NMI, 200 cache-misses], (all, cpu: 2)
------------------------------------------------------------------------------
events RIP kernel function
______ ________________ _______________
1419.00 - ffffffff803666c0 : copy_user_generic_string!
1075.00 - ffffffff804e4fc9 : eth_type_trans
1059.00 - ffffffff804d8baa : dst_release
949.00 - ffffffff80510004 : tcp_established_options
841.00 - ffffffff804fbb20 : ip_rcv_finish
569.00 - ffffffff804ce808 : skb_push
454.00 - ffffffff80502b5d : __inet_lookup_established
453.00 - ffffffff805001a3 : ip_queue_xmit
298.00 - ffffffff804cf5d8 : skb_release_head_state
247.00 - ffffffff804ce74b : skb_copy_and_csum_dev
then, to get the "combination" view of the two counters, i appended the
two command lines:
$ ./kerneltop -e 3 -c 200 -e 2 -c 10000 -C 2
------------------------------------------------------------------------------
KernelTop: 2669 irqs/sec [NMI, cache-misses/cache-refs], (all, cpu: 2)
------------------------------------------------------------------------------
weight RIP kernel function
______ ________________ _______________
35.20 - ffffffff804ce74b : skb_copy_and_csum_dev
33.00 - ffffffff804cb740 : sock_alloc_send_skb
31.26 - ffffffff804ce808 : skb_push
22.43 - ffffffff80510004 : tcp_established_options
19.00 - ffffffff8027d250 : find_get_page
15.76 - ffffffff804e4fc9 : eth_type_trans
15.20 - ffffffff804d8baa : dst_release
14.86 - ffffffff804cf5d8 : skb_release_head_state
14.00 - ffffffff802217d5 : read_hpet
12.00 - ffffffff804ffb7f : __ip_local_out
11.97 - ffffffff804fc0c8 : ip_local_deliver_finish
8.54 - ffffffff805001a3 : ip_queue_xmit
[ It's interesting to see that a seemingly common function,
copy_user_generic_string(), got eliminated from the top spots - because
there are other functions whose relative cachemiss rate is far more
serious. ]
The above "derived" profile output is relatively stable under kerneltop
with the use of ~2600 sample irqs/sec and the 2 seconds default refresh.
I'd encourage you to try to achieve the same quality of output with
static 2600 hz sampling - it wont work with the kind of event rates i've
worked with above, no matter whether you read out a single counter or a
group of counters, atomically or not. (because we just dont get
notification PCs at the relevant hw events - we get PCs with a time
sample)
And that is just one 'rare' event type (cachemisses) - if we had two such
sources (say l2 cachemisses and TLB misses) then such type of combined
view would only be possible if we got independent events from both
hardware events.
And note that once you accept that the highest quality approach is to
sample the hw events independently, all the "group readout" approaches
become a second-tier mechanism. KernelTop uses that model and works just
fine without any group readout and it is making razor sharp profiles,
down to the instruction level.
[ Note that there's special-cases where group-sampling can limp along
with acceptable results: if one of the two counters has so many events
that sampling by time or sampling by the rare event type gives relevant
context info. But the moment both event sources are rare, the group
model breaks down completely and produces meaningless results. It's
just a fundamentally wrong kind of abstraction to mix together
unrelated statistical variables. And that's one of the fundamental
design problems i see with perfmon-v3. ]
Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/