[PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

From: Vikas Shivappa
Date: Fri Jan 06 2017 - 17:27:52 EST


Resending version 5 with updated send list. Sorry for the spam.

Cqm(cache quality monitoring) is part of Intel RDT(resource director
technology) which enables monitoring and controlling of processor shared
resources via MSR interface.

The current upstream cqm(Cache monitoring) has major issues which make
the feature almost unusable which this series tries to fix and also
address Thomas comments on previous versions of the cqm2 patch series to
better document/organize what we are trying to fix.

Changes in V5
- Based on Peterz feedback, removed the file interface in perf_event
cgroup to start and stop continuous monitoring.
- Based on Andi's feedback and references David has sent a patch optimizing
the perf overhead as a seperate patch which is generic and not cqm
specific.

This is a continuation of patch series David(davidcc@xxxxxxxxxx)
previously posted and hence its based on his patches and is also trying
to fix the same issues. Patches apply on 4.10-rc2

Below are the issues and the fixes we attempt-

- Issue(1): Inaccurate data for per package data, systemwide. Just prints
zeros or arbitrary numbers.

Fix: Patches fix this by just throwing an error if the mode is not supported.
The modes supported is task monitoring and cgroup monitoring.
Also the per package
data for say socket x is returned with the -C <cpu on socketx> -G cgrpy option.
The systemwide data can be looked up by monitoring root cgroup.

- Issue(2): RMIDs are global and dont scale with more packages and hence
also run out of RMIDs very soon.

Fix: Support per pkg RMIDs hence scale better with more
packages, and get more RMIDs to use and use when needed (ie when tasks
are actually scheduled on the package).

- Issue(3): Cgroup monitoring is not complete. No hierarchical monitoring
support, inconsistent or wrong data seen when monitoring cgroup.

Fix: cgroup monitoring support added.
Patch adds full cgroup monitoring support. Can monitor different cgroups
in the same hierarchy together and separately. And can also monitor a
task and the cgroup which the task belongs.

- Issue(4): Lot of inconsistent data is seen currently when we monitor different
kind of events like cgroup and task events *together*.

Fix: Patch adds support to be
able to monitor a cgroup x and as task p1 with in a cgroup x and also
monitor different cgroup and tasks together.

- Issue(5): CAT and cqm/mbm write the same PQR_ASSOC_MSR seperately
Fix: Integrate the sched in code and write the PQR_MSR only once every switch_to

- Issue(6): RMID recycling leads to inaccurate data and complicates the
code and increases the code foot print. Currently, it almost makes the
feature *unusable* as we only see zeroes and inconsistent data once we
run out of RMIDs in the life time of a systemboot. The only way to get
right numbers is to reboot the system once we run out of RMIDs.

Root cause: Recycling steals an RMID from an existing event x and gives
it to an other event y. However due to the nature of monitoring
llc_occupancy we may miss tracking an unknown(possibly large) part of
cache fills at the time when event does not have RMID. Hence the user
ends up with inaccurate data for both events x and y and the inaccuracy
is arbitrary and cannot be measured. Even if an event x gets another
RMID very soon after loosing the previous RMID, we still miss all the
occupancy data that was tied to the previous RMID which means we cannot
get accurate data even when for most of the time event has an RMID.
There is no way to guarantee accurate results with recycling and data is
inaccurate by arbitrary degree. The fact that an event can loose an RMID
anytime complicates a lot of code in sched_in, init, count, read. It
also complicates mbm as we may loose the RMID anytime and hence need to
keep a history of all the old counts.

Fix: Recycling is removed based on Tony's idea originally that its
introducing a lot of code, failing to provide accurate data and hence
questionable benefits. Because inspite of several attempts to improve
the recycling there is no way to guarantee accurate data as explained
above and the incorrectness is of arbitrary degree(where we cant say for
ex: the data is off by x% ). As a fix we introduce per-pkg RMIDs to
mitigate the scarcity of RMIDs to a large extent - this is because RMIDs
are plenty - about 2 to 4 per logical processor/SMT thread on each
package. So on a 2 socket BDW system with say 44 logical processors/SMT
threads we have 176 RMIDs on each package (a total of 2x176 = 352
RMIDs). Also cgroup is fully supported and hence many threads like
all threads in one VM/container can be grouped which use just one RMID.
The RMIDs scale with the number of sockets. If we still run out of RMIDs
perf read throws an error because we are not able to monitor as we run
out of limited h/w resource.

This may be better unlike recycling(even with a better version than the
one upstream)where the user thinks events are being monitored but they
actually are not monitored for arbitrary amount of time hence resulting
in inaccurate data of arbitrary degree. The inaccurate data defeats the
purpose of RDT whose goal is to provide a consistent system behaviour by
giving the ability to monitor and control processor resources in an
accurate and reliable fashion. The fix instead helps provide accurate
data and for large extent mitigates the RMID scarcity.

Whats working now (unit tested):
Task monitoring, cgroup hierarchical monitoring, monitor multiple
cgroups, cgroup and task in same cgroup,
per pkg rmids, error on read.

TBD :
- Most of MBM is working but will need updates to hierarchical
monitoring and other new feature related changes we introduce.

Below is a list of patches and what each patch fixes, Each commit
message also gives details on what the patch actually fixes among the
bunch:

[PATCH 02/12] x86/cqm: Remove cqm recycling/conflict handling

Before the patch: Users sees only zeros or wrong data once we run out of
RMIDs.
After: User would see either correct data or an error that we run out of
RMIDs.

[PATCH 03/12] x86/rdt: Add rdt common/cqm compile option
[PATCH 04/12] x86/cqm: Add Per pkg rmid support

Before patch: RMIds are global.
Tests: Available RMIDs increase by x times where x is # of packages.
Adds LAZY RMID alloc - RMIDs are alloced during first sched in

[PATCH 05/12] x86/cqm,perf/core: Cgroup support prepare
[PATCH 06/12] x86/cqm: Add cgroup hierarchical monitoring support
[PATCH 07/12] x86/rdt,cqm: Scheduling support update

Before patch: cgroup monitoring not supported fully.
After: cgroup monitoring is fully supported including hierarchical
monitoring.

[PATCH 08/12] x86/cqm: Add support for monitoring task and cgroup

Before patch: cgroup and task could not be monitored together and would
result in a lot of inconsistent data.
After : Can monitor task and cgroup together and also supports
monitoring a task within a cgroup and the cgroup together.

[PATCH 9/12] x86/cqm: Add RMID reuse

Before patch: Once RMID is used , its never used again.
After: We reuse the RMIDs which are freed. User can specify NOLAZY RMID
allocation and open fails if we fail to get all RMIDs at open.

[PATCH 10/12] perf/core,x86/cqm: Add read for Cgroup events,per pkg
[PATCH 11/12] perf/stat: fix bug in handling events in error state
[PATCH 12/12] perf/stat: revamp read error handling, snapshot and

Patches 1/12 - 9/12 Add all the features but the data is not visible to
the perf/core nor the perf user mode. The 11-12 fix these and make the
data availabe to the perf user mode.