Re: [PATCH v2 09/18] x86/resctrl: Allow resctrl_arch_rmid_read() to sleep

From: James Morse
Date: Thu Mar 09 2023 - 12:36:42 EST


Hi Peter,

On 09/03/2023 13:41, Peter Newman wrote:
> On Wed, Mar 8, 2023 at 6:45 PM James Morse <james.morse@xxxxxxx> wrote:
>> On 06/03/2023 13:14, Peter Newman wrote:
>>> On Mon, Mar 6, 2023 at 12:34 PM James Morse <james.morse@xxxxxxx> wrote:
>>
>>> Instead, when configuring a counter, could you use the firmware table
>>> value to compute the time when the counter will next be valid and return
>>> errors on read requests received before that?
>>
>> The monitor might get re-allocated, re-programmed and become valid for a different
>> PARTID+PMG in the mean time. I don't think these things should remain allocated over a
>> return to user-space. Without doing that I don't see how we can return-early and make
>> progress.
>>
>> How long should a CSU monitor remain allocated to a PARTID+PMG? Currently its only for the
>> duration of the read() syscall on the file.
>>
>>
>> The problem with MPAM is too much of it is optional. This particular behaviour is only
>> valid for CSU monitors, (llc_occupancy), and then, only if your hardware designers didn't
>> have a value to hand when the monitor is programmed, and need to do a scan of the cache to
>> come up with a result. The retry is only triggered if the hardware sets NRDY.
>> This is also only necessary if there aren't enough monitors for every RMID/(PARTID*PMG) to
>> have its own. If there were enough, the monitors could be allocated and programmed at
>> startup, and the whole thing becomes cheaper to access.
>>
>> If a hardware platform needs time to do this, it has to come from somewhere. I don't think
>> maintaining an epoch based list of which monitor secretly belongs to a PARTID+PMG in the
>> hope user-space reads the file again 'quickly enough' is going to be maintainable.
>>
>> If returning errors early is an important use-case, I can suggest ensuring the MPAM driver
>> allocates CSU monitors up-front if there are enough (today it only does this for MBWU
>> monitors). We then have to hope that folk who care about this also build hardware
>> platforms with enough monitors.
>
> Thanks, this makes more sense now. Since CSU data isn't cumulative, I
> see how synchronously collecting a snapshot is useful in this situation.
> I was more concerned about understanding the need for the new behavior
> than getting errors back quickly.
>
> However, I do want to be sure that MBWU counters will never be silently
> deallocated because we will never be able to trust the data unless we
> know that the counter has been watching the group's tasks for the
> entirety of the measurement window.

Absolutely.

The MPAM driver requires the number of monitors to match the value of
resctrl_arch_system_num_rmid_idx(), otherwise 'mbm_local' won't be offered via resctrl.
(see class_has_usable_mbwu() in [0])

If the files exist in resctrl, then a monitor was reserved for this PARTID+PMG, and won't
get allocated for anything else.


[0]
https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/commit/?h=mpam/snapshot/v6.2&id=f28d3fefdcf7022a49f62752acbecf180ea7d32f


> Unlike on AMD, MPAM allows software to control which PARTID+PMG the
> monitoring hardware is watching. Could we instead make the user
> explicitly request the mbm_{total,local}_bytes events be allocated to
> monitoring groups after creating them? Or even just allocating the
> events on monitoring group creation only when they're available could
> also be marginably usable if a single user agent is managing rdtgroups.

Hmmmm, what would that look like to user-space?

I'm against inventing anything new here until there is feature-parity where possible
upstream. It's a walk, then run kind of thing.

I worry that extra steps to setup the monitoring on MPAM:resctrl will be missing or broken
in many (all?) software projects if they're not also required on Intel:resctrl.

My plan for hardware with insufficient counters is to make the counters accessible via
perf, and do that in a way that works on Intel and AMD too.


Thanks,

James