Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

From: Reinette Chatre
Date: Wed Aug 08 2018 - 01:44:52 EST


Hi Tony,

On 8/7/2018 6:28 PM, Luck, Tony wrote:
> Would it help to call routines to read the "before" values of the counter
> twice. The first time to preload the cache with anything needed to execute
> the perf code path.

>>> In an attempt to improve the accuracy of the above I modified it to the
>>> following:
>>>
>>> /* create the two events as before in "enabled" state */
>>> l2_hit_pmcnum = l2_hit_event->hw.event_base_rdpmc;
>>> l2_miss_pmcnum = l2_miss_event->hw.event_base_rdpmc;
>>> local_irq_disable();
>>> /* disable hw prefetchers */
>>> /* init local vars to loop through pseudo-locked mem
> * may take some misses in the perf code
> */
> l2_hits_before = native_read_pmc(l2_hit_pmcnum);
> l2_miss_before = native_read_pmc(l2_miss_pmcnum);
> /* Read counters again, hope no new misses here */
>>> l2_hits_before = native_read_pmc(l2_hit_pmcnum);
>>> l2_miss_before = native_read_pmc(l2_miss_pmcnum);
>>> /* loop through pseudo-locked mem */
>>> l2_hits_after = native_read_pmc(l2_hit_pmcnum);
>>> l2_miss_after = native_read_pmc(l2_miss_pmcnum);
>>> /* enable hw prefetchers */
>>> local_irq_enable();
>

The end of my previous email to Peter contains a solution that does
address all the feedback received up to this point while also able to
obtain (what I thought to be ... more below) accurate results. The code
you comment on below is not this latest version but your suggestion is
valuable and I do try it out on two different ways from what you quote
below to read the perf data.

So, instead of reading data with native_read_pmc() as in the code you
quoted I first test when reading data twice using the original
recommendation of "perf_event_read_local()" and second when reading data
twice using "rdpmcl()" chosen instead of native_read_pmc().

First, reading data using perf_event_read_local() called twice.
When testing as follows:
/* create perf events */
/* disable irq */
/* disable hw prefetchers */
/* init local vars */
/* read before data twice as follows: */
perf_event_read_local(l2_hit_event, &l2_hits_before, NULL, NULL);
perf_event_read_local(l2_miss_event, &l2_miss_before, NULL, NULL);
perf_event_read_local(l2_hit_event, &l2_hits_before, NULL, NULL);
perf_event_read_local(l2_miss_event, &l2_miss_before, NULL, NULL);
/* read through pseudo-locked memory */
perf_event_read_local(l2_hit_event, &l2_hits_after, NULL, NULL);
perf_event_read_local(l2_miss_event, &l2_miss_after, NULL, NULL);
/* re enable hw prefetchers */
/* enable irq */
/* write data to tracepoint */

With the above I am not able to obtain accurate data:
pseudo_lock_mea-351 [002] .... 61.859147: pseudo_lock_l2: hits=4109
miss=0
pseudo_lock_mea-354 [002] .... 63.045734: pseudo_lock_l2: hits=4103
miss=6
pseudo_lock_mea-357 [002] .... 64.104673: pseudo_lock_l2: hits=4106
miss=3
pseudo_lock_mea-360 [002] .... 65.174775: pseudo_lock_l2: hits=4105
miss=5
pseudo_lock_mea-367 [002] .... 66.232308: pseudo_lock_l2: hits=4104
miss=5
pseudo_lock_mea-370 [002] .... 67.291844: pseudo_lock_l2: hits=4103
miss=6
pseudo_lock_mea-373 [002] .... 68.348725: pseudo_lock_l2: hits=4105
miss=5
pseudo_lock_mea-376 [002] .... 69.409738: pseudo_lock_l2: hits=4105
miss=5
pseudo_lock_mea-379 [002] .... 70.466763: pseudo_lock_l2: hits=4105
miss=5


Second, reading data using rdpmcl() called twice.
This is the same solution as documented in my previous email, with the
two extra rdpmcl() calls added. An overview of the flow:

/* create perf events */
/* disable irq */
/* check perf event error state */
/* disable hw prefetchers */
/* init local vars */
/* read before data twice as follows: */
rdpmcl(l2_hit_pmcnum, l2_hits_before);
rdpmcl(l2_miss_pmcnum, l2_miss_before);
rdpmcl(l2_hit_pmcnum, l2_hits_before);
rdpmcl(l2_miss_pmcnum, l2_miss_before);
/* read through pseudo-locked memory */
rdpmcl(l2_hit_pmcnum, l2_hits_after);
rdpmcl(l2_miss_pmcnum, l2_miss_after);
/* re enable hw prefetchers */
/* enable irq */
/* write data to tracepoint */

Here as expected a simple test showed that the data was accurate
(hits=4096 miss=0) so I repeated the creation and measurement of
pseudo-locked region at different sizes under different loads. Each
possible pseudo-lock region size is created and measured 100 times on an
idle system and 100 times on a system with a noisy neighbor - this
resulted in a total of 2800 pseudo-lock region creations each followed
by a measurement of that region.

The results of these tests are the best I have yet seen. In this total
of 2800 measurements the number of cache hits were miscounted only in
eight measurements - each miscount was under(?) counted with one.
Specifically, a memory region consisting of 8192 cache lines was
measured as "hits=8191 miss=0", three memory regions with 12288 cache
lines were measured as "hits=12287 miss=0", two memory regions with
10240 cache lines were measured as "hits=10239 miss=0", and two memory
regions with 14336 cache lines were measured as "hits=14335 miss=0".
I do not think that having the number of cache hits reported as one less
than the number of read attempts would be of big concern.
The miss data remained consistent and reported as zero misses - this is
the exact data we were trying to capture!

Thank you so much for your valuable suggestion. I do hope that we could
proceed with this way of measurement.

Reinette