Re: [PATCH 4/4] selftests/resctrl: Adjust SNC support messages

From: Reinette Chatre
Date: Mon Mar 18 2024 - 15:16:22 EST




On 3/15/2024 11:02 AM, James Morse wrote:
> On 08/03/2024 18:42, Tony Luck wrote:
>> On Fri, Mar 08, 2024 at 06:06:45PM +0000, James Morse wrote:
>>> Hi guys,
>>>
>>> On 07/03/2024 23:16, Tony Luck wrote:
>>>> On Thu, Mar 07, 2024 at 02:39:08PM -0800, Reinette Chatre wrote:
>>>>> Thank you for the example. I find that significantly easier to
>>>>> understand than a single number in a generic "nodes_per_l3_cache".
>>>>> Especially with potential confusion surrounding inconsistent "nodes"
>>>>> between allocation and monitoring.
>>>>>
>>>>> How about domain_cpu_list and domain_cpu_map ?
>>>
>>>> Like this (my test system doesn't have SNC, so all domains are the same):
>>>>
>>>> $ cd /sys/fs/resctrl/info/
>>>> $ grep . */domain*
>>>> L3/domain_cpu_list:0: 0-35,72-107
>>>> L3/domain_cpu_list:1: 36-71,108-143
>>>> L3/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
>>>> L3/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
>>>> L3_MON/domain_cpu_list:0: 0-35,72-107
>>>> L3_MON/domain_cpu_list:1: 36-71,108-143
>>>> L3_MON/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
>>>> L3_MON/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
>>>> MB/domain_cpu_list:0: 0-35,72-107
>>>> MB/domain_cpu_list:1: 36-71,108-143
>>>> MB/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
>>>> MB/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
>>>
>>> This duplicates the information in /sys/devices/system/cpu/cpuX/cache/indexY ... is this
>>> really because that information is, er, wrong on SNC systems. Is it possible to fix that?
>>
>> On an SNC system the resctrl domain for L3_MON becomes the SNC node
>> instead of the L3 cache instance. With 2, 3, or 4 SNC nodes per L3.
>>
>> Even without the SNC issue this duplication may be a useful
>> convienience. On Intel to get from a resctrl domain is a multi-step
>> process to first find which of the indexY directories has level=3
>> and then look for the "id" that matches the domain.
>>
>>> >From Tony's earlier description of how SNC changes things, the MB controls remain
>>> per-socket. To me it feels less invasive to fix the definition of L3 on these platforms to
>>> describe how it behaves (assuming that is possible), and define a new 'MB' that is NUMA
>>> scoped.
>>> This direction of redefining L3 means /sys/fs/resctrl and /sys/devices have different
>>> views of 'the' cache hierarchy.
>>
>> I almost went partly in that direction when I started this epic voyage.
>> The "almost" part was to change the names of the monitoring directories
>> under mon_data from (legacy non-SNC system):
>>
>> $ ls -l mon_data
>> total 0
>> dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_L3_00
>> dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_L3_01
>>
>> to (2 socket, SNC=2 system):
>>
>> $ ls -l mon_data
>> total 0
>> dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_00
>> dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_01
>> dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_02
>> dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_03
>
> This would be useful for MPAM. I've seen a couple of MPAM systems that have per-NUMA MPAM
> controls on the 'L3', but describe it as a single global L3. The MPAM driver currently
> hides this by summing the NUMA node counters and reporting it as the global L3's value.
>
>
>> While that is in some ways a more accurate view, it breaks a lot of
>> legacy monitoring applications that expect the "L3" names.
>
> True - but the behaviour is different from a non SNC system, if this software can read the
> file - but goes wrong because the contents of the file represent something different, its
> still broken.

This is a good point. There is also /sys/fs/resctrl/info/L3_MON to consider and trying to think
what to do about that makes me go in circles about when user space may expect resctrl to indicate
the resource and when user space may expect resctrl to indicate the scope. For example,
/sys/fs/resctrl/mon_data/mon_L3_00 contains files with data that monitor the
"L3" _resource_, no? If we change that to /sys/fs/resctrl/mon_data/mon_NODE_00 then it
switches the meaning of the middle term to be "scope" while it still contains the monitoring
data of the "L3" resource. So does that mean user space would need to rely on
/sys/fs/resctrl/info/L3_MON to obtain the information about which monitoring files
(/sys/fs/resctrl/info/L3_MON/mon_features) are related to the particular resource and then
match those filenames with the filenames in /sys/fs/resctrl/mon_data/mon_NODE_00 to know
which resource it applies to and learn from the directory name what scope measurement is at?

Reinette