Re: [PATCH v19 06/20] x86/resctrl: Introduce snc_nodes_per_l3_cache

From: Tony Luck
Date: Fri May 31 2024 - 14:17:33 EST


On Thu, May 30, 2024 at 01:20:39PM -0700, Reinette Chatre wrote:
> Hi Tony,
>
> On 5/28/24 3:19 PM, Tony Luck wrote:
> > Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
> > and memory controllers on a socket into two or more groups. These are
> > presented to the operating system as NUMA nodes.
> >
> > This may enable some workloads to have slightly lower latency to memory
> > as the memory controller(s) in an SNC node are electrically closer to the
> > CPU cores on that SNC node. This cost may be offset by lower bandwidth
> > since the memory accesses for each core can only be interleaved between
> > the memory controllers on the same SNC node.
> >
> > Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
> > to track L3 cache occupancy and memory bandwidth. There is an MSR that
> > controls how the RMIDs are shared between SNC nodes.
> >
> > The default mode divides them numerically. E.g. when there are two SNC
> > nodes on a socket the lower number half of the RMIDs are given to the
> > first node, the remainder to the second node. This would be difficult
> > to use with the Linux resctrl interface as specific RMID values assigned
> > to resctrl groups are not visible to users.
> >
> > The other mode divides the RMIDs and renumbers the ones on the second
> > SNC node to start from zero.
> >
> > Even with this renumbering SNC mode requires several changes in resctrl
> > behavior for correct operation.
> >
> > Add a static global to arch/x86/kernel/cpu/resctrl/monitor.c to indicate
> > how many SNC domains share an L3 cache instance. Initialize this to
> > "1". Runtime detection of SNC mode will adjust this value.
> >
> > Update all places to take appropriate action when SNC mode is enabled:
> > 1) The number of logical RMIDs per L3 cache available for use is the
> > number of physical RMIDs divided by the number of SNC nodes.
> > 2) Likewise the "mon_scale" value must be divided by the number of SNC
> > nodes.
> > 3) Add a function to convert from logical RMID values (assigned to
> > tasks and loaded into the IA32_PQR_ASSOC MSR on context switch)
> > to physical RMID values to load into IA32_QM_EVTSEL MSR when
> > reading counters on each SNC node.
> >
> > Signed-off-by: Tony Luck <tony.luck@xxxxxxxxx>
> > ---
> > arch/x86/kernel/cpu/resctrl/monitor.c | 37 ++++++++++++++++++++++++---
> > 1 file changed, 33 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> > index 89d7e6fcbaa1..b9b4d2b5ca82 100644
> > --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> > +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> > @@ -97,6 +97,8 @@ unsigned int resctrl_rmid_realloc_limit;
> > #define CF(cf) ((unsigned long)(1048576 * (cf) + 0.5))
> > +static int snc_nodes_per_l3_cache = 1;
> > +
> > /*
> > * The correction factor table is documented in Documentation/arch/x86/resctrl.rst.
> > * If rmid > rmid threshold, MBM total and local values should be multiplied
> > @@ -185,10 +187,37 @@ static inline struct rmid_entry *__rmid_entry(u32 idx)
> > return entry;
> > }
> > -static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
> > +/*
> > + * When Sub-NUMA Cluster (SNC) mode is not enabled, the physical RMID
> > + * is the same as the logical RMID.
> > + *
> > + * When SNC mode is enabled the physical RMIDs are distributed across
> > + * the SNC nodes. E.g. with two SNC nodes per L3 cache and 200 physical
> > + * RMIDs are divided with 0..99 on the first node and 100..199 on
> > + * the second node. Compute the value of the physical RMID to pass to
> > + * resctrl_arch_rmid_read().
>
> Please stop rushing version after version. I do not think you read the
> above after you wrote it. The sentences run into each other.

Re-written. Would you like to try reviewing these patches one at a time
as I fix them? That will:

a) Slow me down.
b) Avoid me building subsequent patches on earlier mistakes.
c) Give you bite-sized chunks to review in each sitting (I think
the overall direction of the series is well enough understood
at this point).

I've attached the updated version of patch 6 at the end of this e-mail.

> Could this be specific about what is meant by "physical" and "logical" RMID?
> To me "physical RMID" implies the RMID used by hardware and "logical RMID"
> is the RMID used by software ... but when it comes to SNC it is actually:
> "physical RMID" - RMID used by MSR_IA32_QM_EVTSEL
> "logical RMID" - RMID used by software and the MSR_IA32_PQR_ASSOC register
>
> > + *
> > + * Caller is responsible to make sure execution running on a CPU in
>
> "is responsible" and "make sure" means the same, no?
>
> "make sure execution running"?

Also re-written.

> (Looking ahead in this series and coming back to this, this looks like
> rushed work that you in turn expect folks spend quality time reviewing.)
>
> > + * the domain to be read.
> > + */
> > +static int logical_rmid_to_physical_rmid(int lrmid)
> > +{
> > + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> > + int cpu = smp_processor_id();
> > +
> > + if (snc_nodes_per_l3_cache == 1)
> > + return lrmid;
> > +
> > + return lrmid + (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
> > +}
> > +
> > +static int __rmid_read(u32 lrmid,
> > + enum resctrl_event_id eventid, u64 *val)
>
> This line does not need to be split.

Joined now. I also pulled in your suggestion from a later patch to
rename this __rmid_read_phys() and do the logical to physical RMID
translation at the two callsites.

> > {
> > u64 msr_val;
> > + int prmid;
> > + prmid = logical_rmid_to_physical_rmid(lrmid);
> > /*
> > * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
> > * with a valid event code for supported resource type and the bits
> > @@ -197,7 +226,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
> > * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
> > * are error bits.
> > */
> > - wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
> > + wrmsr(MSR_IA32_QM_EVTSEL, eventid, prmid);
> > rdmsrl(MSR_IA32_QM_CTR, msr_val);
> > if (msr_val & RMID_VAL_ERROR)
> > @@ -1022,8 +1051,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
> > int ret;
> > resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
> > - hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
> > - r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
> > + hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
> > + r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
> > hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
> > if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
>
> Reinette

-Tony

Proposed v6 patch with fixes applied. I didn't include a URL
to the RDT architecture spec I reference in the comment for
logical_rmid_to_physical_rmid() because Intel URLs are notoriously
unstable. But I did check that a web search finds the document based on
the title. With Google it was second hit for me. Bing lists it as first
result.