Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
From: David Carrillo-Cisneros
Date: Thu Feb 02 2017 - 20:41:16 EST
On Thu, Feb 2, 2017 at 3:41 PM, Luck, Tony <tony.luck@xxxxxxxxx> wrote:
> On Thu, Feb 02, 2017 at 12:22:42PM -0800, David Carrillo-Cisneros wrote:
>> There is no need to change perf(1) to support
>> # perf stat -I 1000 -e intel_cqm/llc_occupancy {command}
>>
>> the PMU can work with resctrl to provide the support through
>> perf_event_open, with the advantage that tools other than perf could
>> also use it.
>
> I agree it would be better to expose the counters through
> a standard perf_event_open() interface ... but we don't seem
> to have had much luck doing that so far.
>
> That would need the requirements to be re-written with the
> focus of what does resctrl need to do to support each of the
> perf(1) command line modes of operation. The fact that these
> counters work rather differently from normal h/w counters
> has resulted in massively complex volumes of code trying
> to map them into what perf_event_open() expects.
>
> The key points of weirdness seem to be:
>
> 1) We need to allocate an RMID for the duration of monitoring. While
> there are quite a lot of RMIDs, it is easy to envision scenarios
> where there are not enough.
>
> 2) We need to load that RMID into PQR_ASSOC on a logical CPU whenever a process
> of interest is running.
>
> 3) An RMID is shared by llc_occupancy, local_bytes and total_bytes events
>
> 4) For llc_occupancy the count can change even when none of the processes
> are running becauase cache lines are evicted
>
> 5) llc_occupancy measures the delta, not the absolute occupancy. To
> get a good result requires monitoring from process creation (or
> lots of patience, or the nuclear option "wbinvd").
>
> 6) RMID counters are package scoped
>
>
> These result in all sorts of hard to resolve situations. E.g. you are
> monitoring local bandwidth coming from logical CPU2 using RMID=22. I'm
> looking at the cache occupancy of PID=234 using RMID=45. The scheduler
> decides to run my proocess on your CPU. We can only load one RMID, so
> one of us will be disappointed (unless we have some crazy complex code
> where your instance of perf borrows RMID=45 and reads out the local
> byte count on sched_in() and sched_out() to add to the runing count
> you were keeping against RMID=22).
>
> How can we document such restrictions for people who haven't been
> digging in this code for over a year?
>
> I think a perf_event_open() interface would make some simple cases
> work, but result in some swearing once people start running multiple
> complex monitors at the same time.
More problems:
7) Time multiplexing of RMIDs is hard because llc_occupancy cannot be reset.
8) Only one RMID per CPU can be loaded at a time into PQR_ASSOC.
Most of the complexity in past attempts were mainly caused by:
A. Task events being defined as system-wide and not package-wide.
What you describe in points (4) and (6) made this complicated.
B. The cgroup hierarchy, due to (7) and (8).
A and B caused the bulk of the code by complicating RMID assignment,
reading and rotation.
Now that we've learned from the past experience, we have defined
per-domain monitoring and use flat groups. FWICT, that enough to allow
a simple implementation that can be expressed through perf_event_open.