Re: [PATCH v18 15/17] x86/resctrl: Fix RMID reading sanity check for Sub-NUMA (SNC) mode

From: Tony Luck
Date: Wed May 22 2024 - 19:47:36 EST


On Wed, May 22, 2024 at 02:25:23PM -0700, Reinette Chatre wrote:
> > + /*
> > + * SNC: OK to read events on any CPU sharing same L3
> > + * cache instance.
> > + */
> > + if (d->display_id != get_cpu_cacheinfo_id(smp_processor_id(),
> > + r->mon_display_scope))
>
> By hardcoding that mon_display_scope is a cache instead of using get_domain_id_from_scope()
> it seems that all pretending about being generic has just been abandoned at this point.

Yes. It now seems like a futile quest to make this look
like something generic. All this code is operating on the
rdt_resources_all[RDT_RESOURCE_L3] resource (which by its very name is
"L3" scoped). In the SNC case the L3 has been divided (in some senses,
but not all) into nodes.

Given that pretending isn't working ... just be explicit?

Some "thinking aloud" follows ...

struct rdt_resource:
In order to track monitor events, resctrl must build a domain list based
on the smallest measurement scope. So with SNC enabled, that is the
node. With it disabled it is L3 cache scope (which on existing systems
is the same as node scope).

Maybe keep .mon_scope with the existing name, but define it to be the
minimum measurement scope and use it to build domains. So it
defaults to RESCTRL_L3_CACHE but SNC detection will rewrite it to
RESCTRL_L3_NODE.

Drop the .mon_display_scope field. By definition it must always have
the value RESCTRL_L3_CACHE. So replace checks that compare values
rdt_resources_all[RDT_RESOURCE_L3] of .mon_scope & .mon_display_scope
with:

if (r->mon_scope != RESCTRL_L3_CACHE)
// SNC stuff
else
// regular stuff

struct rdt_mon_domain:
In the rdt_mon_domain rename the display_id field with the more
honest name "l3_cache_id". In addition save a pointer to the
.shared_cpu_map of the L3 cache. When SNC is off, this will be the
same as the d->hdr.cpu_mask for the domain. For SNC on it will be
a superset (encompassing all the bits from cpu_masks in all domains
that share an L3 instance).

Where SNC specifc code is required, the check becomes:

if (d->hdr.id != d->l3_cache_id)
// SNC stuff
else
// regular stuff

The l3_cache_id can be used in mkdir code to make the mon_L3_XX
directories. The L3 .shared_cpu_map in picking a CPU to read
the counters for the "sum" files. l3_cache_id also indicates
which domains should be summed.


Does this look like a useful direction to pursue?

-Tony