RE: About add an A64FX cache control function into resctrl

From: tan.shaopeng@xxxxxxxxxxx
Date: Wed Jul 21 2021 - 04:18:19 EST


Hi Reinette,

> On 7/7/2021 4:26 AM, tan.shaopeng@xxxxxxxxxxx wrote:
> >>> Sorry, I have not explained A64FX's sector cache function well yet.
> >>> I think I need explain this function from different perspective.
> >>
> >> You have explained the A64FX's sector cache function well. I have
> >> also read both specs to understand it better. It appears to me that
> >> you are not considering the resctrl architecture as part of your
> >> solution but instead just forcing your architecture onto the resctrl
> >> filesystem. For example, in resctrl the resource groups are not just
> >> a directory structure but has significance in what is being
> >> represented within the directory (a class of service). The files
> >> within a resource group's directory build on that. From your side I
> >> have not seen any effort in aligning the sector cache function with the
> resctrl architecture but instead you are just changing resctrl interface to match
> the A64FX architecture.
> >>
> >> Could you please take a moment to understand what resctrl is and how
> >> it could be mapped to A64FX in a coherent way?
> >
> > Previously, my idea is based on how to make instructions use different
> > sectors in one task. After I studied resctrl, to utilize resctrl
> > architecture on A64FX, I think it’s better to assign one sector to one
> > task. Thanks for your idea that "sectors" could be considered the same
> > as the resctrl "classes of service".
> >
> > Based on your idea, I am considering the implementation details.
> > In this email, I will explain the outline of new proposal, and then
> > please allow me to confirm a few technologies about resctrl.
> >
> > The outline of my proposal is as follows.
> > - Add a sector function equivalent to Intel's CAT function into resctrl.
> > (divide shared L2 cache into multiple partitions for multiple cores
> > use)
> > - Allocate one sector to one resource group (one CLOSID). Since one
> > core can only be assigned to one resource group, on A64FX each core
> > only uses one sector at a time.
>
> ok, so a sector is a portion of cache and matches with what can be represented
> with a resource group.
>
> The second part of your comment is not clear to me. In the first part you
> mention: "one core can only be assigned to one resource group" - this seems to
> indicate some static assignment between cores and sectors and if this is the

Sorry, does "static assignment between cores and sectors" mean
each core always use a fixed sector id? For example, core 0 always
use sector 0 at any case. It is not.

> case this needs more thinking since the current implementation assumes that
> any core that can access the cache can access all resource groups associated
> with that cache. On the other hand, you mention "on A64FX each core only uses
> one sector at a time" - this now sounds dynamic and is how resctrl works since
> the CPU is assigned a single class of service to indicate all resources
> accessible to it.

It is correct. Each core can be assigned to any resource group, and
each core only uses one sector at a time. Additionally, which sector
each core uses depends on the resource group (class of service) ID.

> > - Disable A64FX's HPC tag address override function. We only set each
> > core's default sector value according to closid(default sector
> ID=CLOSID).
> > - No L1 cache control since L1 cache is not shared for cores. It is not
> > necessary to add L1 cache interface for schemata file.
> > - No need to update schemata interface. Resctrl's L2 cache interface
> > (L2: <cache_id0> = <cbm>; <cache_id1> = <cbm>; ...)
> > will be used as it is. However, on A64FX, <cbm> does not indicate
> > the position of cache partition, only indicate the number of
> > cache ways (size).
>
> From what I understand the upcoming MPAM support would make this easier
> to do.
>
> >
> > This is the smallest start of incorporating sector cache function into
> > resctrl. I will consider if we could add more sector cache features
> > into resctrl (e.g. selecting different sectors from one task) after
> > finishing this.
> >
> > (some questions are below)
> >
> >>>
> >>>> On 5/17/2021 1:31 AM, tan.shaopeng@xxxxxxxxxxx wrote:
> >>
> >>> --------
> >>> A64FX NUMA-PE-Cache Architecture:
> >>> NUMA0:
> >>> PE0:
> >>> L1sector0,L1sector1,L1sector2,L1sector3
> >>> PE1:
> >>> L1sector0,L1sector1,L1sector2,L1sector3
> >>> ...
> >>> PE11:
> >>> L1sector0,L1sector1,L1sector2,L1sector3
> >>>
> >>> L2sector0,1/L2sector2,3
> >>> NUMA1:
> >>> PE0:
> >>> L1sector0,L1sector1,L1sector2,L1sector3
> >>> ...
> >>> PE11:
> >>> L1sector0,L1sector1,L1sector2,L1sector3
> >>>
> >>> L2sector0,1/L2sector2,3
> >>> NUMA2:
> >>> ...
> >>> NUMA3:
> >>> ...
> >>> --------
> >>> In A64FX processor, one L1 sector cache capacity setting register is
> >>> only for one PE and not shared among PEs. L2 sector cache maximum
> >>> capacity setting registers are shared among PEs in same NUMA, and it
> >>> is to be noted that changing these registers in one PE influences other PE.
> >>
> >> Understood. cache affinity is familiar to resctrl. When a CPU becomes
> >> online it is discovered which caches/resources it has affinity to.
> >> Resources then have CPU mask associated with them to indicate on
> >> which CPU a register could be changed to configure the
> >> resource/cache. See
> >> domain_add_cpu() and struct rdt_domain.
> >
> > Is the following understanding correct?
> > Struct rdt_domain is a group of online CPUs that share a same cache
> > instance. When a CPU is online(resctrl initialization), the
> > domain_add_cpu() function add the online cpu to corresponding
> > rdt_domain (in rdt_resource:domains list). For example, if there are
> > 4 L2 cache instances, then there will be 4 rdt_domain in the list and
> > each CPU is assigned to corresponding rdt_domain.
>
> Correct.
>
> >
> > The set values of cache/memory are stored in the *ctrl_val array
> > (indexed by CLOSID) of struct rdt_domain. For example, in CAT
> > function, the CBM value of CLOSID=x is stored in ctrl_val [x].
> > When we create a resource group and write set values of cache into the
> > schemata file, the update_domains() function updates the CBM value to
> > ctrl_val [CLOSID = resource group ID] in rdt_domain and updates the
> > CBM value to CBM register(MSR_IA32_Lx_CBM_BASE).
>
> For the most part, yes. The only part that I would like to clarify is that each
> CLOSID is represented by a different register, which register is updated
> depends on which CLOSID is changed. Could be written as
> MSR_IA32_L2_CBM_CLOSID/MSR_IA32_L3_CBM_CLOSID. The "BASE"
> register is CLOSID 0, the default, and the other registers are determined as
> offset from it.
>
> Also, the registers have the scope of the resource/cache. So, for example, if
> CPU 0 and CPU 1 share a L2 cache then it is only necessary to update the
> register on one of these CPUs.

Thanks for your explanation. I understood it.
In addition, A64FX's L2 cache setting registers have similar scopes
of resource/cache, and only necessary to update the register on one of
these CPUs.

> >>> The number of ways for L2 Sector ID (0,1 or 2,3) can be set through
> >>> any PEs in same NUMA. The sector ID 0,1 and 2,3 are not available at
> >>> the same time in same NUMA.
> >>>
> >>>
> >>> I think, in your idea, a resource group will be created for each sector ID.
> >>> (> "sectors" could be considered the same as the resctrl "classes of
> >>> service") Then, an example of resource group is created as follows.
> >>> ・ L1: NUMAX-PEY-L1sector0 (X = 0,1,2,3.Y = 0,1,2 ... 11),
> >>> ・ L2: NUMAX-L2sector0 (X = 0,1,2,3)
> >>>
> >>> In this example, sector with same ID(0) of all PEs is allocated to
> >>> resource group. The L1D caches are numbered from
> >>> NUMA0_PE0-L1sector0(0) to NUMA4_PE11-L1sector0(47) and the L2
> >> caches
> >>> numbered from
> >>> NUMA0-L2sector0(0) to NUM4-L2sector0(3).
> >>> (NUMA number X is from 0-4, PE number Y is from 0-11)
> >>> (1) The number of ways of NUMAX-PEY-L1sector0 can be set
> independently
> >>> for each PEs (0-47). When run a task on this resource group,
> >>> we cannot control on which PE the task is running on and how
> many
> >>> cache ways the task is using.
> >>
> >> resctrl does not control the affinity on which PE/CPU a task is run.
> >> resctrl is an interface with which to configure how resources are
> >> allocated on the system. resctrl could thus provide interface with
> >> which each sector of each cache instance is assigned a number of cache
> ways.
> >> resctrl also provides an interface to assign a task with a class of
> >> service (sector id?). Through this the task obtains access to all
> >> resources that is allocated to the particular class of service
> >> (sector id?). Depending on which CPU the task is running it may
> >> indeed experience different performance if the sector id it is
> >> running with does not have the same allocations on all cache instances.
> The affinity of the task needs to be managed separately using for example
> taskset.
> >> Please see Documentation/x86/resctrl.rst "Examples for RDT allocation
> usage"
> >
> > In resctrl_sched_in(), there are comments as follow:
> > /*
> > * If this task has a closid/rmid assigned, use it.
> > * Else use the closid/rmid assigned to this cpu.
> > */
> > I thought when we write PID to tasks file, this task (PID) will only
> > run on the CPUs which are specified in cpus file in the same resource
> > group. So, the task_struct's closid and cpu's closid is the same.
> > When task's closid is different from cpu's closid?
>
> resctrl does not manage the affinity of tasks.
>
> Tony recently summarized the cpus file very well to me: The actual semantics of
> the CPUs file is to associate a CLOSid for a task that is in the default resctrl
> group ? while it is running on one of the listed CPUs.
>
> To answer your question the task's closid could be different from the CPU's
> closid if the task's closid is 0 while it is running on a CPU that is in the cpus file
> of a non-default resource group.
>
> You can see a summary of the decision flow in section "Resource allocation
> rules" in Documentation/x86/resctrl.rst
>
> The "cpus" file was created in support of the real-time use cases. In these use
> cases a group of CPUs can be designated as supporting the real-time work and
> with their own resource group and assigned the needed resources to do the
> real-time work. A real-time task can then be started with affinity to those CPUs
> and dynamically any kernel threads (that will be started on the same CPU)
> doing work on behalf of this task would be able to use the resources set aside
> for the real-time work.

Thanks for your explanation. I understood it.

I will implement this sector function, and if I have other questions,
please allow me to mail you.

Best regards,
Tan Shaopeng