Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

From: Reinette Chatre
Date: Fri Aug 03 2018 - 14:39:13 EST


Hi Peter,

On 8/3/2018 8:25 AM, Peter Zijlstra wrote:
> On Fri, Aug 03, 2018 at 08:18:09AM -0700, Reinette Chatre wrote:
>> You state that you understand what we are trying to do and I hope that I
>> convinced you that we are not able to accomplish the same by following
>> your guidance.
>
> No, I said I understood your pmc reserve patch and its implications.
>
> I have no clue what you're trying to do with resctl, nor why you think
> this is not feasible with perf. And if it really is not feasible, you'll
> have to live without it.

I can surely provide the details on what we are doing with resctrl and
elaborate more on why this is not feasible with the full perf kernel API.

In summary:
Building on top of Cache Allocation Technology (CAT) we load a portion
of memory into a specified region of cache. After the region of cache
obtained its data it (the cache region) is configured (via CAT) to only
serve cache hits - this pre-loaded memory cannot be evicted from the
cache. We call this "cache pseudo-locking" - the memory has been
"pseudo-locked" to the cache.

To measure how successful the pseudo-locking of the memory is we can use
the precision capable performance events on our platforms: start the
cache hits and miss counters, read the pseudo-locked memory, stop the
counters. This measurement is done with interrupts and hardware
prefetchers disabled to ensure that _only_ access to the pseudo-locked
memory is measured.

Any additional code or data accessed either by the counter management or
even by the loops reading the memory itself can contribute to cache
hits/misses measured for that instead of the memory we are trying to access.

Even within the current measurement code we had to take a lot of care to
not use, for example, pointers to obtain information about the memory to
be measured. The information had to be local variables.

Looking at if we were to build on top of the kernel perf event API
(perf_event_create_kernel_counter(), perf_event_enable(),
perf_event_disable(), ...). Just looking at perf_event_enable() -
ideally this would be as lean as possible to only enable the event and
not result in itself contributing the the measurement. First, the
enabling of the event is not as lean to fulfill this requirement since
it executes more code after the event was actually enabled. Also, the
code relies on a mutex so we cannot use it with interrupts disabled.

We have two types of customers of this feature: those who require very
low latency and those who require high determinism. In either case a
measured cache miss is of concern to them and our goal is to provide a
memory region for which the number of cache misses can be demonstrated.

Reinette