On Fri, Jan 20, 2017 at 1:08 PM, Shivappa Vikas
<vikas.shivappa@xxxxxxxxx> wrote:
On Fri, 20 Jan 2017, David Carrillo-Cisneros wrote:
On Fri, Jan 20, 2017 at 5:29 AM Thomas Gleixner <tglx@xxxxxxxxxxxxx>
wrote:
On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote:
If resctrl groups could lift the restriction of one resctl per CLOSID,
then the user can create many resctrl in the way perf cgroups are
created now. The advantage is that there wont be cgroup hierarchy!
making things much simpler. Also no need to optimize perf event
context switch to make llc_occupancy work.
So if I understand you correctly, then you want a mechanism to have
groups
of entities (tasks, cpus) and associate them to a particular resource
control group.
So they share the CLOSID of the control group and each entity group can
have its own RMID.
Now you want to be able to move the entity groups around between control
groups without losing the RMID associated to the entity group.
So the whole picture would look like this:
rdt -> CTRLGRP -> CLOSID
mon -> MONGRP -> RMID
And you want to move MONGRP from one CTRLGRP to another.
Almost, but not quite. My idea is no have MONGRP and CTRLGRP to be the
same thing. Details below.
Can you please write up in a abstract way what the design requirements
are
that you need. So far we are talking about implementation details and
unspecfied wishlists, but what we really need is an abstract requirement.
My pleasure:
Design Proposal for Monitoring of RDT Allocation Groups.
-----------------------------------------------------------------------------
Currently each CTRLGRP has a unique CLOSID and a (most likely) unique
cache bitmask (CBM) per resource. Non-unique CBM are possible although
useless. An unique CLOSID forbids more CTRLGRPs than physical CLOSIDs.
CLOSIDs are much more scarce than RMIDs.
If we lift the condition of unique CLOSID, then the user can create
multiple CTRLGRPs with the same schemata. Internally, those CTRCGRP
would share the CLOSID and RDT_Allocation must maintain the schemata
to CLOSID relationship (similarly to what the previous CAT driver used
to do). Elements in CTRLGRP.tasks and CTRLGRP.cpus behave the same as
now: adding an element removes it from its previous CTRLGRP.
This change would allow further partitioning the allocation groups
into (allocation, monitoring) groups as follows:
With allocation only:
CTRLGRP0 CTRLGRP_ALLOC_ONLY
schemata: L3:0=0xff0 L3:0=x00f
tasks: PID0 P0_0,P0_1,P1_0,P1_1
cpus: 0x3 0xC
Not clear what the PID0 and P0_0 mean ?
PID0, and P*_* are arbitrary PIDs. The tasks file works the same as it
does now in RDT. I am not changing that.
If you have to support something like MONGRP and CTRLGRP overall you want to
allow for a task to be present in multiple groups ?
I am not proposing to support MONGRP and CTRLGRP. I am proposing to
allow monitoring of CTRGRPs only.
If we want to monitor (P0_0,P0_1), (P1_0,P1_1) and CPUs 0xC
independently, with the new model we could create:
CTRLGRP0 CTRLGRP1 CTRLGRP2 CTRLGRP3
schemata: L3:0=0xff0 L3:0=x00f L3:0=0x00f L3:0=0x00f
tasks: PID0 <none> P0_0,P0_1 P1_0, P1_1
cpus: 0x3 0xC 0x0 0x0
Internally, CTRLGRP1, CTRLGRP2, and CTRLGRP2 would share the CLOSID for
(L3,0).
Now we can ask perf to monitor any of the CTRLGRPs independently -once
we solve how to pass to perf what (CTRLGRP, resource_id) to monitor-.
The perf_event will reserve and assign the RMID to the monitored
CTRLGRP. The RDT subsystem will context switch the whole PQR_ASSOC MSR
(CLOSID and RMID), so perf won't have to.
This can be solved by suporting just the -t in perf and a new option in perf
to suport resctrl group monitoring (something similar to -R). That way we
provide the flexible granularity to monitor tasks independent of whether
they are in any resctrl group (and hence also a subset).
One of the key points of my proposal is to remove monitoring PIDs
independently. That simplifies things by letting RDT handle CLOSIDs
and RMIDs together.
CTRLGRP TASKS MASK
CTRLGRP1 PID1,PID2 L3:0=0Xf,1=0xf0
CTRLGRP2 PID3,PID4 L3:0=0Xf0,1=0xf00
#perf stat -e llc_occupancy -R CTRLGRP1
#perf stat -e llc_occupancy -t PID3,PID4
The RMID allocation is independent of resctrl CLOSid allocation and hence
the RMID is not always married to CLOS which seems like the requirement
here.
It is not a requirement. Both the CLOSID and the RMID of a CTRLGRP can
change in my proposal.
OR
We could have CTRLGRPs with control_only, monitor_only or control_monitor
options.
now a task could be present in both control_only and monitor_only
group or it could be present only in a control_monitor_group. The
transitions from one state to another are guarded by this same principle.
CTRLGRP TASKS MASK TYPE
CTRLGRP1 PID1,PID2 L3:0=0Xf,1=0xf0 control_only
CTRLGRP2 PID3,PID4 L3:0=0Xf0,1=0xf00 control_only
CTRLGRP3 PID2,PID3 monitor_only
CTRLGRP4 PID5,PID6 L3:0=0Xf0,1=0xf00 control_monitor
CTRLGRP3 allows you to monitor a set of tasks which is not bound to be in
the same CTRLGRP and you can add or move tasks into this. The adding and
removing the tasks is whats easily supported compared to the task
granularity although such a thing could still be supported with the task
granularity.
CTRLGRP4 allows you to tie the monitor and control together so when tasks
move in and out of this we still have that group to consider. And these
groups still retain the cpu masks like before so that cpu monitoring is
still supported.
Instead of having 3 types of CTRLGRPl, I am proposing one kind
(equivalent to your control_monitor type) that uses a non-zero RMID
when an appropriate perf_event is attached to it. What advantages do
you see on having 3 distinct types?
In this case we would need a new option to support the ctrlgrp monitoring in
perf or a new tool to do all this if we dont want to bother perf.
Agree, I like expanding the cgroup fd option to take CTRLGRP fds, as
described in the Implementation Ideas part of the proposal.
If CTRLGRP's schemata changes, the RDT subsystem will find a new
CLOSID for the new schemata (potentially reusing an existing one) or
fail (just like the old CAT used to). The RMID does not change during
schemata updates.
If a CTRLGRP dies, the monitoring perf_event continues to exists as a
useless wraith, just as happens with cgroup events now.
Since CTRLGRPs have no hierarchy. There is no need to handle that in
the new RDT Monitoring PMU, greatly simplifying it over the previously
proposed versions.
A breaking change in user observed behavior with respect to the
existing CQM PMU is that there wouldn't be task events. A task must be
part of a CTRLGRP and events are created per (CTRLGRP, resource_id)
pair. If an user wants to monitor a task across multiple resources
(e.g. l3_occupancy across two packages), she must create one event per
resource_id and add the two counts.
I see this breaking change as an improvement, since hiding the cache
topology to user space introduced lots of ugliness and complexity to
the CQM PMU without improving accuracy over user space adding the
events.
Implementation ideas:
First idea is to expose one monitoring file per resource in a CTRLGRP,
so the list of CTRLGRP's files would be: schemata, tasks, cpus,
monitor_l3_0, monitor_l3_1, ...
the monitor_<resource_id> file descriptor is passed to perf_event_open
in the way cgroup file descriptors are passed now. All events to the
same (CTRLGRP,resource_id) share RMID.
The RMID allocation part can either be handled by RDT Allocation or by
the RDT Monitoring PMU. Either ways, the existence of PMU's
perf_events allocates/releases the RMID.
Also, since this new design removes hierarchy and task events, it
allows for a simple solution of the RMID rotation problem. The removal
of task events eliminates the cgroup vs task event conflict existing
in the upstream version; it also removes the need to ensure that all
active packages have RMIDs at the same time that added complexity to
my version of CQM/CMT. Lastly, the removal of hierarchy removes the
reliance on cgroups, the complex tree based read, and all the hooks
and cgroup files that "raped" the cgroup subsystem.
Yes, not sure if the view is same after I sent the implementation details in
documentation :) (most likely it is).
But the option could be to not support perf_cgroup for cqm and support a new
option in perf to monitor resctrl groups and tasks (or some other options
like mongrp)
Agree with no supporting cgroups. This proposal is about supporting
neither cgroups nor tasks and do all monitoring through CTRLGRPs
through an expansion of an existing perf option.
I am so far inclined to creating a new monitoring interface that way we dont
try to "rape" the existing perf specifics for this RDT or later RDT
quirk/features.
On first inspection it seems to me like perf would be fine with this
approach. It requires no changes to the system call and just some
changes in the way the cgroup_fd is handled in perf_event_open
(besides making sure that a context-less PMU don't break things). Do
you foresee any conflict with future features?
Thanks,
David