Re: Region aware RDT options for resctrl
From: Luck, Tony
Date: Fri Apr 11 2025 - 16:56:36 EST
On Fri, Apr 11, 2025 at 01:54:12PM -0700, Luck, Tony wrote:
Add Cc: lkml
> A future CPU from Intel will implement "region aware" memory bandwidth
> monitoring and bandwidth allocation. This will provide for more granular
> monitoring and control for heterogeneous memory configurations. BIOS
> will populate an ACPI table that describes which system physical address
> ranges belong to each region. E.g. for a two socket system with both
> DDR and CXL memory regions could be assigned like this:
>
> Region 0: Local DDR
> Region 1: Remote DDR
> Region 2: Local CXL
> Region 3: Remote CXL
>
> Details of the ACPI tables and MMIO registers in the "Intel(R)
> Resource Director Technology Architecture Specification" here:
> https://cdrdv2.intel.com/v1/dl/getContent/789566
>
> The existing Linux resctrl user interface will need some extensions
> to handle these new hardware monitors and controls. Here are some
> options for discussion with the goal of aligning on some user interface
> that meets now and near future needs of all architectures.
>
> Memory bandwidth monitoring
> ---------------------------
>
> The existing interface provides two files in each of the per-domain
> directories under "mon_data":
>
> mbm_local_bytes: Count of bytes transferred to/from "local" memory
> mbm_total_bytes: Count of bytes transferred to/from all memory
>
> Proposal is to provide a new file to report traffic for each region
> for however many regions are implemented on a system:
>
> mbm_region_0_bytes
> ...
> mbm_region_N_bytes
>
> Potentially a compatability file:
>
> mbm_total_bytes
>
> could be included which provides data for the sum across all regions.
>
> Providing a similar mbm_local_bytes file would be challenging as the
> BIOS controls the region numbering and it may be difficult/impossible
> for Linux to determine which regions report "local" memory traffic.
> A future implementation may allow the OS to define the region mapping
> which makes things even more complex as the mappings could be changed
> at run time.
>
> Memory bandwidth allocation
> ---------------------------
>
> This is more complex as there are some additional capability improvements
> in addition to providing separate controls for each region. Resctrl
> already has support to control bandwidth to "slow" memory on AMD systems
> providing separate controls for "regular" and "slow" memory in the schemata file:
>
> $ cat schemata
> MB: 0=100;1=100
> SMBA:0=100;1=100
>
> It would be tricky for resctrl to build on this for regions for the same
> reason the mbm_local_bytes would be difficult. No way for Linux to determine
> which regions are CXL vs. DDR. This approach would also lose ability to
> control local vs. remote bandwidth. Also not extensible for future memory
> configuration options.
>
> Option 1: Per-memory regions might be described individually like this:
>
> $ cat schemata
> RMB0:0=100;1=100
> RMB1:0=75;1=75
> RMB2:0=25;1=25
>
> Option 2: Add to schemata per-line syntax to keep one line, but specify each region
> in some comma separated list:
>
> $ cat schemata
> RMB:0=100,75,50,25;1=100,50,25
>
> But there are additional capabilities that would be useful to expose that
> may influence decisions.
>
> 1) Better than 1% throttle granularity
>
> Existing Intel implementations provide throttle controls in 10% steps. The
> architectural enumeration allows for at best 1% steps. But this may still be
> inadequate to provide distinct controls when very high levels of throttling
> are needed for low priority workloads. The RDT architecture specification
> allows for bandwidth limits to be specified from 1 (maximum throttle) to 511
> (no throttle) though implementations may provide other ranges, e.g. 1..255.
>
> Option 1: Specify bandwidth in schemata with floating point values
>
> $ cat info/MB/min_bandwidth
> 0.1957
> $ info/MB/bandwidth_gran
> 0.1957
> $ cat schemata
> RMB0:0=100;1=100
> RMB1:0=0.75;1=1.25
>
> Option 2: Change from "percentage" to some enumerated range
>
> $ cat schemata
> RMB0:0=511;1=511
>
> 2) Min/max ranges for bandwidth
>
> When a single fixed value for bandwidth limits is provided, users are
> forced to be overly conservative when assigning limits in the schemata
> file in order to keep memory controllers within capacity limits. This
> can result in jobs being throttled unnecessarily at times when there is
> plenty of bandwidth capacity available.
>
> The latest RDT architecture specification allows for setting a minimum
> and maximum bandwidth in addition to the normal limit. Example usage
> would be to set a higher maximum value for low priority jobs to allow
> them to run faster when the system has available memory bandwidth capacity.
> High priority jobs can have a minimum bandwidth setting so that when
> the system is running close to capacity limits, those jobs are not
> throttled as much (or at all) while lower priority jobs are throttled.
>
> Syntax option:
>
> $ cat schemata
> RMB0:0=25<50<100;1=25<50<100
>
> Combining some of these options for new capabilities we could have:
>
> $ cat schemata
> RMB0:0=25<50<100;1=25<50<100
> RMB1:0=2.5<30<40;1=2.5<30<40
> RMB2:0=80<90<100;1=80<90<100
>
> -Tony