Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
From: Stephane Eranian
Date: Fri Jan 20 2017 - 14:31:18 EST
On Thu, Jan 19, 2017 at 6:32 PM, Vikas Shivappa
<vikas.shivappa@xxxxxxxxxxxxxxx> wrote:
>
> Resending including Thomas , also with some changes. Sorry for the spam
>
> Based on Thomas and Peterz feedback Can think of two design
> variants which target:
>
> -Support monitoring and allocating using the same resctrl group.
> user can use a resctrl group to allocate resources and also monitor
> them (with respect to tasks or cpu)
>
> -Also allows monitoring outside of resctrl so that user can
> monitor subgroups who use the same closid. This mode can be used
> when user wants to monitor more than just the resctrl groups.
>
> The first design version uses and modifies perf_cgroup, second version
> builds a new interface resmon. The first version is close to the patches
> sent with some additions/changes. This includes details of the design as
> per Thomas/Peterz feedback.
>
> 1> First Design option: without modifying the resctrl and using perf
> --------------------------------------------------------------------
> --------------------------------------------------------------------
>
> In this design everything in resctrl interface works like
> before (the info, resource group files like task schemata all remain the
> same)
>
>
> Monitor cqm using perf
> ----------------------
>
> perf can monitor individual tasks using the -t
> option just like before.
>
> # perf stat -e llc_occupancy -t PID1,PID2
>
> user can monitor the cpu occupancy using the -C option in perf:
>
> # perf stat -e llc_occupancy -C 5
>
> Below shows how user can monitor cgroup occupancy:
>
> # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/
> # mkdir /sys/fs/cgroup/perf_event/g1
> # mkdir /sys/fs/cgroup/perf_event/g2
> # echo PID1 > /sys/fs/cgroup/perf_event/g2/tasks
>
> # perf stat -e intel_cqm/llc_occupancy/ -a -G g2
>
Presented this way, this does not quite address the use case I
described earlier here.
We want to be able to monitor the cgroup allocations from first thread
creation. What you have above has a large gap. Many apps do allocations
as their very first steps, so if you do:
$ my_test_prg &
[1456]
$ echo 1456 >/sys/fs/cgroup/perf_event/g2/tasks
$ perf stat -e intel_cqm/llc_occupancy/ -a -G g2
You have a race. But if you allow:
$ perf stat -e intel_cqm/llc_occupancy/ -a -G g2 (i.e, on an empty cgroup)
$ echo $$ >/sys/fs/cgroup/perf_event/g2/tasks (put shell in cgroup, so
my_test_prg runs immediately in the cgroup)
$ my_test_prg &
Then there is a way to avoid the gap.
>
> To monitor a resctrl group, user can group the same tasks in resctrl
> group into the cgroup.
>
> To monitor the tasks in p1 in example 2 below, add the tasks in resctrl
> group p1 to cgroup g1
>
> # echo 5678 > /sys/fs/cgroup/perf_event/g1/tasks
>
> Introducing a new option for resctrl may complicate monitoring because
> supporting cgroup 'task groups' and resctrl 'task groups' leads to
> situations where:
> if the groups intersect, then there is no way to know what
> l3_allocations contribute to which group.
>
> ex:
> p1 has tasks t1, t2, t3
> g1 has tasks t2, t3, t4
>
> The only way to get occupancy for g1 and p1 would be to allocate an RMID
> for each task which can as well be done with the -t option.
>
> Monitoring cqm cgroups Implementation
> -------------------------------------
>
> When monitoring two different cgroups in the same hierarchy (ex say g11
> has an ancestor g1 which are both being monitored as shown below) we
> need the g11 counts to be considered for g1 as well.
>
> # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/
> # mkdir /sys/fs/cgroup/perf_event/g1
> # mkdir /sys/fs/cgroup/perf_event/g1/g11
>
> When measuring for g1 llc_occupancy we cannot write two different RMIDs
> (because we need to count for g11 as well)
> during context switch to measure the occupancy for both g1 and g11.
> Hence the driver maintains this information and writes the RMID of the
> lowest member in the ancestory which is being monitored during ctx
> switch.
>
> The cqm_info is added to the perf_cgroup structure to maintain this
> information. The structure is allocated and destroyed at css_alloc and
> css_free. All the events tied to a cgroup can use the same
> information while reading the counts.
>
> struct perf_cgroup {
> #ifdef CONFIG_INTEL_RDT_M
> void *cqm_info;
> #endif
> ...
>
> }
>
> struct cqm_info {
> bool mon_enabled;
> int level;
> u32 *rmid;
> struct cgrp_cqm_info *mfa;
> struct list_head tskmon_rlist;
> };
>
> Due to the hierarchical nature of cgroups, every cgroup just
> monitors for the 'nearest monitored ancestor' at all times.
> Since root cgroup is always monitored, all descendents
> at boot time monitor for root and hence all mfa points to root
> except for root->mfa which is NULL.
>
> 1. RMID setup: When cgroup x start monitoring:
> for each descendent y, if y's mfa->level < x->level, then
> y->mfa = x. (Where level of root node = 0...)
> 2. sched_in: During sched_in for x
> if (x->mon_enabled) choose x->rmid
> else choose x->mfa->rmid.
> 3. read: for each descendent of cgroup x
> if (x->monitored) count += rmid_read(x->rmid).
> 4. evt_destroy: for each descendent y of x, if (y->mfa == x)
> then y->mfa = x->mfa. Meaning if any descendent was monitoring for
> x, set that descendent to monitor for the cgroup which x was
> monitoring for.
>
> To monitor a task in a cgroup x along with monitoring cgroup x itself
> cqm_info maintains a list of tasks that are being monitored in the
> cgroup.
>
> When a task which belongs to a cgroup x is being monitored, it
> always uses its own task->rmid even if cgroup x is monitored during sched_in.
> To account for the counts of such tasks, cgroup keeps this list
> and parses it during read.
> taskmon_rlist is used to maintain the list. The list is modified when a
> task is attached to the cgroup or removed from the group.
>
> Example 1 (Some examples modeled from resctrl ui documentation)
> ---------
>
> A single socket system which has real-time tasks running on core 4-7 and
> non real-time workload assigned to core 0-3. The real-time tasks share
> text and data, so a per task association is not required and due to
> interaction with the kernel it's desired that the kernel on these cores shares L3
> with the tasks.
>
> # cd /sys/fs/resctrl
>
> # echo "L3:0=3ff" > schemata
>
> core 0-1 are assigned to the new group and make sure that the
> kernel and the tasks running there get 50% of the cache.
>
> # echo 03 > p0/cpus
>
> monitor the cpus 0-1
>
> # perf stat -e llc_occupancy -C 0-1
>
> Example 2
> ---------
>
> A real time task running on cpu 2-3(socket 0) is allocated a dedicated 25% of the
> cache.
>
> # cd /sys/fs/resctrl
>
> # mkdir p1
> # echo "L3:0=0f00;1=ffff" > p1/schemata
> # echo 5678 > p1/tasks
> # taskset -cp 2-3 5678
>
> To monitor the same group of tasks create a cgroup g1
>
> # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/
> # mkdir /sys/fs/cgroup/perf_event/g1
> # perf stat -e llc_occupancy -a -G g1
>
> Example 3
> ---------
>
> sometimes user may just want to profile the cache occupancy first before
> assigning any CLOSids. Also this provides an override option where user
> can monitor some tasks which have say CLOS 0 that he is about to place
> in a CLOSId based on the amount of cache occupancy. This could apply to
> the same real time tasks above where user is caliberating the % of cache
> thats needed.
>
> # perf stat -e llc_occupancy -t PIDx,PIDy
>
> RMID allocation
> ---------------
>
> RMIDs are allocated per package to achieve better scaling of RMIDs.
> RMIDs are plenty (2-4 per logical processor) and also are per package
> meaning a two socket system would have twice the number of RMIDs.
> If we still run out of RMIDs an error is thrown that monitoring wasnt
> possible as the RMID wasnt available.
>
> Kernel Scheduling
> -----------------
>
> During ctx switch cqm choses the RMID in the following priority
>
> 1. if cpu has a RMID , choose that
> 2. if the task has a RMID directly tied to it choose that (task is
> monitored)
> 3. choose the RMID of the task's cgroup (by default tasks belong to root
> cgroup with RMID 0)
>
> Read
> ----
>
> When user calls cqm to retrieve the monitored count, we read the
> counter_msr and return the count. For cgroup hierarcy , the count is
> measured as explained in the cgroup implementation section by traversing
> the cgroup hierarchy.
>
>
> 2> Second Design option: Build a new usermode tool resmon
> ---------------------------------------------------------
> ---------------------------------------------------------
>
> In this design everything in resctrl interface works like
> before (the info, resource group files like task schemata all remain the
> same).
>
> This version supports monitoring resctrl groups directly.
> But we need a user interface for the user to read the counters. We can
> create one file to set monitoring and one
> file in resctrl directory which will reflect the counts but may not be
> efficient as a lot of times user reads the counts frequently.
>
> Build a new user mode interface resmon
> --------------------------------------
>
> Since modifying the existing perf to
> suit the different h/w architecture seems to not follow the CAT
> interface model, it may well be better to have a different and dedicated
> interface for the RDT monitoring (just like we had a new fs for CAT)
>
> resmon supports monitoring a resctrl group or a task. The two modes may
> provide enough granularity needed for monitoring
> -can monitor cpu data.
> -can monitor per resctrl group data.
> -can choose custom or subset of tasks with in a resctrl group and monitor.
>
> # resmon [<options>]
> -r <resctrl group>
> -t <PID>
> -s <mon_mask>
> -I <time in ms>
>
> "resctrl group": is the resctrl directory.
>
> "mon_mask: is a bit mask of logical packages which indicates which packages user is
> interested in monitoring.
>
> "time in ms": The time for which the monitoring takes place
> (this can potentially be changed to start and stop/read options)
>
> Example 1 (Some examples modeled from resctrl ui documentation)
> ---------
>
> A single socket system which has real-time tasks running on core 4-7 and
> non real-time workload assigned to core 0-3. The real-time tasks share
> text and data, so a per task association is not required and due to
> interaction with the kernel it's desired that the kernel on these cores shares L3
> with the tasks.
>
> # cd /sys/fs/resctrl
> # mkdir p0
> # echo "L3:0=3ff" > p0/schemata
>
> core 0-1 are assigned to the new group and make sure that the
> kernel and the tasks running there get 50% of the cache.
>
> # echo 03 > p0/cpus
>
> monitor the cpus 0-1 for 10s.
>
> # resmon -r p0 -s 1 -I 10000
>
> Example 2
> ---------
>
> A real time task running on cpu 2-3(socket 0) is allocated a dedicated 25% of the
> cache.
>
> # cd /sys/fs/resctrl
>
> # mkdir p1
> # echo "L3:0=0f00;1=ffff" > p1/schemata
> # echo 5678 > p1/tasks
> # taskset -cp 2-3 5678
>
> Monitor the task for 5s on socket zero
>
> # resmon -r p1 -s 1 -I 5000
>
> Example 3
> ---------
>
> sometimes user may just want to profile the cache occupancy first before
> assigning any CLOSids. Also this provides an override option where user
> can monitor some tasks which have say CLOS 0 that he is about to place
> in a CLOSId based on the amount of cache occupancy. This could apply to
> the same real time tasks above where user is caliberating the % of cache
> thats needed.
>
> # resmon -t PIDx,PIDy -s 1 -I 10000
>
> returns the sum of count of PIDx and PIDy
>
> RMID Allocation
> ---------------
>
> This would remain the same like design version 1, where we support per
> package RMIDs and throw error when out of RMIDs due to h/w limited
> RMIDs.
>
>