Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

From: Reinette Chatre
Date: Wed Nov 09 2022 - 14:12:16 EST


Hi James,

On 11/9/2022 9:59 AM, James Morse wrote:
> Hi Reinette,
>
> On 08/11/2022 21:28, Reinette Chatre wrote:
>> On 11/3/2022 10:06 AM, James Morse wrote:
>>> (I've not got to the last message in this part of the thread yes - I'm out of time this
>>> week, back Monday!)
>>>
>>> On 21/10/2022 21:09, Reinette Chatre wrote:
>>>> On 10/19/2022 6:57 AM, James Morse wrote:
>>>>> On 17/10/2022 11:15, Peter Newman wrote:
>>>>>> On Wed, Oct 12, 2022 at 6:55 PM James Morse <james.morse@xxxxxxx> wrote:
>>
>> ...
>>
>>>>>> If there are a lot more PARTIDs than PMGs, then it would fit well with a
>>>>>> user who never creates child MON groups. In case the number of MON
>>>>>> groups gets ahead of the number of CTRL_MON groups and you've run out of
>>>>>> PMGs, perhaps you would just try to allocate another PARTID and program
>>>>>> the same partitioning configuration before giving up.
>>>>>
>>>>> User-space can choose to do this.
>>>>> If the kernel tries to be clever and do this behind user-space's back, it needs to
>>>>> allocate two monitors for this secretly-two-control-groups, and always sum the counters
>>>>> before reporting them to user-space.
>>>
>>>> If I understand this scenario correctly, the kernel is already doing this.
>>>> As implemented in mon_event_count() the monitor data of a CTRL_MON group is
>>>> the sum of the parent CTRL_MON group and all its child MON groups.
>>>
>>> That is true. MPAM has an additional headache here as it needs to allocate a monitor in
>>> order to read the counters. If there are enough monitors for each CLOSID*RMID to have one,
>>> then MPAM can export the counter files in the same way RDT does.
>>>
>>> While there are systems that have enough monitors, I don't think this is going to be the
>>> norm. To allow systems that don't have a surfeit of monitors to use the counters, I plan
>>> to export the values from resctrl_arch_rmid_read() via perf. (but only for bandwidth counters)
>
>> This sounds related to the way monitoring was done in earlier kernels. This was
>> long before I become involved with this work. Unfortunately I am not familiar with
>> all the history involved that ended in it being removed from the kernel.
>
> Yup, I'm aware there is some history to this. It's not appropriate for the llc_occupancy
> counter as that reports state, instead of events.

Perf counts events while a process is running so memory bandwidth monitoring may
also be impacted by the caveats Peter mentioned for the upcoming AMD changes:

https://lore.kernel.org/lkml/CALPaoCidd+WwGTyE3D74LhoL13ce+EvdTmOnyPrQN62j+zZ1fg@xxxxxxxxxxxxxx/
("This has the caveats that evictions while one task is running could have
resulted from a previous task on the current CPU, but will be counted
against the new task's software-RMID, ...")

...
>> The new counters will also not reflect the task's history.
>
> Indeed. I anticipate user-space is sampling this file periodically, otherwise it can't
> calculate a MB/s from the raw byte-count. I don't think losing the history is problem.

Indeed. Cache occupancy may experience more corner cases depending on
the workloads. Your point that user space needs to know how/that counters
are impacted is important.

>
> The state before the change being lost could be a problem, but this is a difference with
> the way MPAM works. I think its best to just expose this property to user-space, as I
> don't think its feasible to work around.
>
> User-space would probably ignore the counter for a period of time after the move, as
> depending on where the regulation is happening, it may take a little while for the CLOSID
> change to take effect.

Agree.


>> Moving an arm64 monitor group may thus have a few surprises for user
>> space while sounding complex to support. Would adding all this additional
>> support be worth it if the guidance to user space is to instead create many
>> control groups in such a control-group-rich environment?
>
> I'd prefer it didn't exist at all, but if there are reasons to support it on x86, I'd like
> the MPAM support to be as similar as possible. I'm willing to accept (advertised!) noise
> in the counters, but a whole missing syscall is a harder sell.

ok.

>
>
>>> Whether this old counters keep counting needs exposing to user-space so that it is aware.
>>
>> Could you please elaborate? Do old counters not always keep counting?
>
> Its not new - but the expectation is the mv/rename support does this atomically without
> glitching/resetting the counters. Because of that new expectation, I think it needs
> exposing to user-space.
>
> Something should be indicated to user-space so it knows it can move monitor groups around,
> otherwise its another 'try it and see'.

ok.

>
>>> To solve Peter's use-case, we also need:
>>> * to expose how many new groups can be created at each level.
>>> This is because MPAM doesn't have a property like num_rmid.
>
>> Unfortunately num_rmid is part of the user space interface. While MPAM
>> does not have "RMIDs" it seems that num_rmid can still be relevant
>> based on what it is described to represent in Documentation/x86/resctrl.rst:
>> "This is the upper bound for how many "CTRL_MON" + "MON" groups can
>> be created."
>
> I agree it can't be removed, and MPAM systems will need to put a value there.
> The problem is 'rmid' has a well known definition, even if the kernel documentation is
> nuanced.
>
> This might be contentious, but ideally I'd 'deprecate' num_rmid, and split it into two
> properties that don't reference an architecture. (Obviously the files have to stay for at
> least the next 10 years!)

I think this may be difficult considering the various user space clients
already in use but doing so is reasonable.

Reinette