Re: [PATCH 2/3] perf metricgroup: Fix uncore metric expressions

From: Ian Rogers
Date: Wed Sep 09 2020 - 23:41:15 EST


On Wed, Sep 9, 2020 at 10:52 AM Ian Rogers <irogers@xxxxxxxxxx> wrote:
>
> On Wed, Sep 9, 2020 at 2:53 AM Namhyung Kim <namhyung@xxxxxxxxxx> wrote:
> >
> > Hi Ian,
> >
> > On Wed, Sep 9, 2020 at 5:02 PM Ian Rogers <irogers@xxxxxxxxxx> wrote:
> > >
> > > A metric like DRAM_BW_Use has on SkylakeX events uncore_imc/cas_count_read/
> > > and uncore_imc/case_count_write/. These events open 6 events per socket
> > > with pmu names of uncore_imc_[0-5]. The current metric setup code in
> > > find_evsel_group assumes one ID will map to 1 event to be recorded in
> > > metric_events. For events with multiple matches, the first event is
> > > recorded in metric_events (avoiding matching >1 event with the same
> > > name) and the evlist_used updated so that duplicate events aren't
> > > removed when the evlist has unused events removed.
> > >
> > > Before this change:
> > > $ /tmp/perf/perf stat -M DRAM_BW_Use -a -- sleep 1
> > >
> > > Performance counter stats for 'system wide':
> > >
> > > 41.14 MiB uncore_imc/cas_count_read/
> > > 1,002,614,251 ns duration_time
> > >
> > > 1.002614251 seconds time elapsed
> > >
> > > After this change:
> > > $ /tmp/perf/perf stat -M DRAM_BW_Use -a -- sleep 1
> > >
> > > Performance counter stats for 'system wide':
> > >
> > > 157.47 MiB uncore_imc/cas_count_read/ # 0.00 DRAM_BW_Use
> >
> > Hmm.. I guess the 0.00 result is incorrect, no?
>
> Agreed. There are a number of pre-existing bugs in this code. I'll try
> to look into this one.

There was a bug where the metric_leader wasn't being set correctly and
so the accumulation of values wasn't working as expected. There's a
small fix in:
https://lore.kernel.org/lkml/20200910032632.511566-3-irogers@xxxxxxxxxx/T/#u
that also does the change I mentioned below.

The metric still remains at 0.0 following this change as I believe
there is a bug in the metric. The metric expression is:
"( 64 * ( uncore_imc@cas_count_read@ + uncore_imc@cas_count_write@ ) /
1000000000 ) / duration_time"
ie the counts of uncore_imc/cas_count_read/ and
uncore_imc/cas_count_write/ are being first being scaled up by 64,
that is to turn a count of cache lines into bytes, the count is then
divided by 1000000000 or a GB to supposedly give GB/s. However, the
counts are read and scaled:

...
void perf_stat__update_shadow_stats(struct evsel *counter, u64 count,
...
count *= counter->scale;
...

The scale value from
/sys/devices/uncore_imc_0/events/cas_count_read.scale is
6.103515625e-5 or 64/(1024*1024). This means the result of the
expression is 64 times too large but in PB/s and so too small to
display. I don't rule out there being other issues but this appears to
be a significant one. Printing using intervals also looks suspicious
as the count appears to just increase rather than vary up and down.
Jin Yao, I don't know if you can take a look?

Thanks,
Ian

> > > 126.97 MiB uncore_imc/cas_count_write/
> > > 1,003,019,728 ns duration_time
> > >
> > > Erroneous duplication introduced in:
> > > commit 2440689d62e9 ("perf metricgroup: Remove duped metric group events").
> > >
> > > Fixes: ded80bda8bc9 ("perf expr: Migrate expr ids table to a hashmap").
> > > Reported-by: Jin, Yao <yao.jin@xxxxxxxxxxxxxxx>
> > > Signed-off-by: Ian Rogers <irogers@xxxxxxxxxx>
> > > ---
> > [SNIP]
> > > @@ -248,6 +260,16 @@ static struct evsel *find_evsel_group(struct evlist *perf_evlist,
> > > ev = metric_events[i];
> > > ev->metric_leader = ev;
> > > set_bit(ev->idx, evlist_used);
> > > + /*
> > > + * Mark two events with identical names in the same group as
> > > + * being in use as uncore events may be duplicated for each pmu.
> > > + */
> > > + evlist__for_each_entry(perf_evlist, ev) {
> > > + if (metric_events[i]->leader == ev->leader &&
> > > + !strcmp(metric_events[i]->name, ev->name)) {
> > > + set_bit(ev->idx, evlist_used);
> >
> > I'm not sure whether they are grouped together.
> > But if so, you can use for_each_group_member(ev, leader).
>
> Good suggestion, unfortunately the groups may be removed for things
> like NMI watchdog or --metric-no-group and so there wouldn't be a
> leader to follow. We could do something like this:
>
> if (metric_events[i]->leader) {
> for_each_group_member(ev, leader) {
> if (!strcmp(metric_events[i]->name, ev->name))
> set_bit(ev->idx, evlist_used);
> }
> } else {
> evlist__for_each_entry(perf_evlist, ev) {
> if (!ev->leader && !strcmp(metric_events[i]->name, ev->name))
> set_bit(ev->idx, evlist_used);
> }
> }
> }
>
> What do you think?
>
> Thanks,
> Ian
>
> > Thanks
> > Namhyung
> >
> >
> > > + }
> > > + }
> > > }
> > >
> > > return metric_events[0];
> > > --
> > > 2.28.0.526.ge36021eeef-goog
> > >