Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
From: Vikas Shivappa
Date: Thu Jan 19 2017 - 21:32:43 EST
Resending including Thomas , also with some changes. Sorry for the spam
Based on Thomas and Peterz feedback Can think of two design
variants which target:
-Support monitoring and allocating using the same resctrl group.
user can use a resctrl group to allocate resources and also monitor
them (with respect to tasks or cpu)
-Also allows monitoring outside of resctrl so that user can
monitor subgroups who use the same closid. This mode can be used
when user wants to monitor more than just the resctrl groups.
The first design version uses and modifies perf_cgroup, second version
builds a new interface resmon. The first version is close to the patches
sent with some additions/changes. This includes details of the design as
per Thomas/Peterz feedback.
1> First Design option: without modifying the resctrl and using perf
--------------------------------------------------------------------
--------------------------------------------------------------------
In this design everything in resctrl interface works like
before (the info, resource group files like task schemata all remain the
same)
Monitor cqm using perf
----------------------
perf can monitor individual tasks using the -t
option just like before.
# perf stat -e llc_occupancy -t PID1,PID2
user can monitor the cpu occupancy using the -C option in perf:
# perf stat -e llc_occupancy -C 5
Below shows how user can monitor cgroup occupancy:
# mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/
# mkdir /sys/fs/cgroup/perf_event/g1
# mkdir /sys/fs/cgroup/perf_event/g2
# echo PID1 > /sys/fs/cgroup/perf_event/g2/tasks
# perf stat -e intel_cqm/llc_occupancy/ -a -G g2
To monitor a resctrl group, user can group the same tasks in resctrl
group into the cgroup.
To monitor the tasks in p1 in example 2 below, add the tasks in resctrl
group p1 to cgroup g1
# echo 5678 > /sys/fs/cgroup/perf_event/g1/tasks
Introducing a new option for resctrl may complicate monitoring because
supporting cgroup 'task groups' and resctrl 'task groups' leads to
situations where:
if the groups intersect, then there is no way to know what
l3_allocations contribute to which group.
ex:
p1 has tasks t1, t2, t3
g1 has tasks t2, t3, t4
The only way to get occupancy for g1 and p1 would be to allocate an RMID
for each task which can as well be done with the -t option.
Monitoring cqm cgroups Implementation
-------------------------------------
When monitoring two different cgroups in the same hierarchy (ex say g11
has an ancestor g1 which are both being monitored as shown below) we
need the g11 counts to be considered for g1 as well.
# mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/
# mkdir /sys/fs/cgroup/perf_event/g1
# mkdir /sys/fs/cgroup/perf_event/g1/g11
When measuring for g1 llc_occupancy we cannot write two different RMIDs
(because we need to count for g11 as well)
during context switch to measure the occupancy for both g1 and g11.
Hence the driver maintains this information and writes the RMID of the
lowest member in the ancestory which is being monitored during ctx
switch.
The cqm_info is added to the perf_cgroup structure to maintain this
information. The structure is allocated and destroyed at css_alloc and
css_free. All the events tied to a cgroup can use the same
information while reading the counts.
struct perf_cgroup {
#ifdef CONFIG_INTEL_RDT_M
void *cqm_info;
#endif
...
}
struct cqm_info {
bool mon_enabled;
int level;
u32 *rmid;
struct cgrp_cqm_info *mfa;
struct list_head tskmon_rlist;
};
Due to the hierarchical nature of cgroups, every cgroup just
monitors for the 'nearest monitored ancestor' at all times.
Since root cgroup is always monitored, all descendents
at boot time monitor for root and hence all mfa points to root
except for root->mfa which is NULL.
1. RMID setup: When cgroup x start monitoring:
for each descendent y, if y's mfa->level < x->level, then
y->mfa = x. (Where level of root node = 0...)
2. sched_in: During sched_in for x
if (x->mon_enabled) choose x->rmid
else choose x->mfa->rmid.
3. read: for each descendent of cgroup x
if (x->monitored) count += rmid_read(x->rmid).
4. evt_destroy: for each descendent y of x, if (y->mfa == x)
then y->mfa = x->mfa. Meaning if any descendent was monitoring for
x, set that descendent to monitor for the cgroup which x was
monitoring for.
To monitor a task in a cgroup x along with monitoring cgroup x itself
cqm_info maintains a list of tasks that are being monitored in the
cgroup.
When a task which belongs to a cgroup x is being monitored, it
always uses its own task->rmid even if cgroup x is monitored during sched_in.
To account for the counts of such tasks, cgroup keeps this list
and parses it during read.
taskmon_rlist is used to maintain the list. The list is modified when a
task is attached to the cgroup or removed from the group.
Example 1 (Some examples modeled from resctrl ui documentation)
---------
A single socket system which has real-time tasks running on core 4-7 and
non real-time workload assigned to core 0-3. The real-time tasks share
text and data, so a per task association is not required and due to
interaction with the kernel it's desired that the kernel on these cores shares L3
with the tasks.
# cd /sys/fs/resctrl
# echo "L3:0=3ff" > schemata
core 0-1 are assigned to the new group and make sure that the
kernel and the tasks running there get 50% of the cache.
# echo 03 > p0/cpus
monitor the cpus 0-1
# perf stat -e llc_occupancy -C 0-1
Example 2
---------
A real time task running on cpu 2-3(socket 0) is allocated a dedicated 25% of the
cache.
# cd /sys/fs/resctrl
# mkdir p1
# echo "L3:0=0f00;1=ffff" > p1/schemata
# echo 5678 > p1/tasks
# taskset -cp 2-3 5678
To monitor the same group of tasks create a cgroup g1
# mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/
# mkdir /sys/fs/cgroup/perf_event/g1
# perf stat -e llc_occupancy -a -G g1
Example 3
---------
sometimes user may just want to profile the cache occupancy first before
assigning any CLOSids. Also this provides an override option where user
can monitor some tasks which have say CLOS 0 that he is about to place
in a CLOSId based on the amount of cache occupancy. This could apply to
the same real time tasks above where user is caliberating the % of cache
thats needed.
# perf stat -e llc_occupancy -t PIDx,PIDy
RMID allocation
---------------
RMIDs are allocated per package to achieve better scaling of RMIDs.
RMIDs are plenty (2-4 per logical processor) and also are per package
meaning a two socket system would have twice the number of RMIDs.
If we still run out of RMIDs an error is thrown that monitoring wasnt
possible as the RMID wasnt available.
Kernel Scheduling
-----------------
During ctx switch cqm choses the RMID in the following priority
1. if cpu has a RMID , choose that
2. if the task has a RMID directly tied to it choose that (task is
monitored)
3. choose the RMID of the task's cgroup (by default tasks belong to root
cgroup with RMID 0)
Read
----
When user calls cqm to retrieve the monitored count, we read the
counter_msr and return the count. For cgroup hierarcy , the count is
measured as explained in the cgroup implementation section by traversing
the cgroup hierarchy.
2> Second Design option: Build a new usermode tool resmon
---------------------------------------------------------
---------------------------------------------------------
In this design everything in resctrl interface works like
before (the info, resource group files like task schemata all remain the
same).
This version supports monitoring resctrl groups directly.
But we need a user interface for the user to read the counters. We can
create one file to set monitoring and one
file in resctrl directory which will reflect the counts but may not be
efficient as a lot of times user reads the counts frequently.
Build a new user mode interface resmon
--------------------------------------
Since modifying the existing perf to
suit the different h/w architecture seems to not follow the CAT
interface model, it may well be better to have a different and dedicated
interface for the RDT monitoring (just like we had a new fs for CAT)
resmon supports monitoring a resctrl group or a task. The two modes may
provide enough granularity needed for monitoring
-can monitor cpu data.
-can monitor per resctrl group data.
-can choose custom or subset of tasks with in a resctrl group and monitor.
# resmon [<options>]
-r <resctrl group>
-t <PID>
-s <mon_mask>
-I <time in ms>
"resctrl group": is the resctrl directory.
"mon_mask: is a bit mask of logical packages which indicates which packages user is
interested in monitoring.
"time in ms": The time for which the monitoring takes place
(this can potentially be changed to start and stop/read options)
Example 1 (Some examples modeled from resctrl ui documentation)
---------
A single socket system which has real-time tasks running on core 4-7 and
non real-time workload assigned to core 0-3. The real-time tasks share
text and data, so a per task association is not required and due to
interaction with the kernel it's desired that the kernel on these cores shares L3
with the tasks.
# cd /sys/fs/resctrl
# mkdir p0
# echo "L3:0=3ff" > p0/schemata
core 0-1 are assigned to the new group and make sure that the
kernel and the tasks running there get 50% of the cache.
# echo 03 > p0/cpus
monitor the cpus 0-1 for 10s.
# resmon -r p0 -s 1 -I 10000
Example 2
---------
A real time task running on cpu 2-3(socket 0) is allocated a dedicated 25% of the
cache.
# cd /sys/fs/resctrl
# mkdir p1
# echo "L3:0=0f00;1=ffff" > p1/schemata
# echo 5678 > p1/tasks
# taskset -cp 2-3 5678
Monitor the task for 5s on socket zero
# resmon -r p1 -s 1 -I 5000
Example 3
---------
sometimes user may just want to profile the cache occupancy first before
assigning any CLOSids. Also this provides an override option where user
can monitor some tasks which have say CLOS 0 that he is about to place
in a CLOSId based on the amount of cache occupancy. This could apply to
the same real time tasks above where user is caliberating the % of cache
thats needed.
# resmon -t PIDx,PIDy -s 1 -I 10000
returns the sum of count of PIDx and PIDy
RMID Allocation
---------------
This would remain the same like design version 1, where we support per
package RMIDs and throw error when out of RMIDs due to h/w limited
RMIDs.