Re: [PATCH v4 1/6] x86/resctrl: Parse ACPI ERDT table and map RMDD domains by L3 cache ID

From: Reinette Chatre

Date: Thu Jun 18 2026 - 19:38:22 EST


Hi Chenyu,

On 6/13/26 12:56 AM, Chen Yu wrote:
> From: Anil S Keshavamurthy <anil.s.keshavamurthy@xxxxxxxxx>
>
> ERDT(Enhanced RDT) introduces a new top-level ACPI structure

(nit: please add space between ERDT and the "(")

> (the ERDT) that the kernel must parse before any enhanced
> RDT feature can be used. The ERDT improves the existing RDT
> by switching low-level register access from MSR-based to
> MMIO-based, which is more efficient.

(nit: please use full line length available for changelog)

Above states clearly "switching low-level register access from MSR-based to
MMIO-based". This implementation clearly does so ... but the new ACPI tables
supporting the MMIO-based access also include, for example, the
maximum RMIDs that determines how many MMIO registers are supported.
Even so, from what I can tell, the "maximum RMID" that the ACPI tables
report is ignored and this implementation instead continues to rely on
the maximum RMID reported via CPUID that indicates the maximum RMIDs supported
by the *MSR* to also guide the access of the MMIO registers.

This implementation thus looks to straddle some middle ground between
ACPI and CPUID enumeration, using return of one to guide access to the
other that does not seem robust and requires platforms to forever support both.


> The ERDT structure may include several sub ACPI tables:
>
> - Resource Management Domain Description Structure (RMDD)
> - CPU Agent Collection Description Structure (CACD)
> - Cache Monitoring Registers for CPU Agents Description Structure
> (CMRC)
>
> There is one ERDT table per platform.
> Each RMDD substructure in ERDT represents one resource management
> domain (RMD), also known as an L3 domain. Thus, the total number
> of RMDDs equals the number of L3 domains on the platform.
> Each RMDD contains information such as MMIO addresses. This address
> is used to retrieve RDT metrics like L3 occupancy.
>
> Add basic ERDT ACPI table and sub-table parsing, and store the
> relevant tables for later processing.
>
> Among these sub-tables, RMDD requires special handling. There is one
> RMDD per domain, and the domain ID reuses the L3 cache ID. Many code
> paths need to retrieve an RMDD efficiently by domain ID (L3 cache ID).
> Because L3 cache IDs are derived from x2APIC IDs and are not
> contiguous, using a plain array indexed by domain ID would waste
> memory. As a trade-off, an xarray is used to store these tables, with

(nit: needs imperative eg. "use an xarray ...")

> the L3 cache ID as the key.
>
> Suggested-by: Tony Luck <tony.luck@xxxxxxxxx>
> Tested-by: Hongyu Ning <hongyu.ning@xxxxxxxxxxxxxxx>
> Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@xxxxxxxxx>
> ---
...

> ---
> arch/x86/Kconfig | 4 +-
> arch/x86/include/asm/apic.h | 1 +
> arch/x86/include/asm/resctrl.h | 2 +
> arch/x86/kernel/cpu/resctrl/Makefile | 1 +
> arch/x86/kernel/cpu/resctrl/core.c | 14 +-
> arch/x86/kernel/cpu/resctrl/erdt.c | 296 +++++++++++++++++++++++++
> arch/x86/kernel/cpu/resctrl/internal.h | 3 +
> arch/x86/kernel/cpu/topology.c | 2 +-

Please split out the non-resctrl changes into separate patch with sybsystem-specific
prefix to highlight these changes to not appear to sneak changes into other
subsystems via resctrl. They can stil form part of this series as preparatory patches.

> 8 files changed, 319 insertions(+), 4 deletions(-)
> create mode 100644 arch/x86/kernel/cpu/resctrl/erdt.c
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index f3f7cb01d69d..97d210bd9bb5 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -515,7 +515,7 @@ config X86_MPPARSE
>
> config X86_CPU_RESCTRL
> bool "x86 CPU resource control support"
> - depends on X86 && (CPU_SUP_INTEL || CPU_SUP_AMD)
> + depends on X86_64 && (CPU_SUP_INTEL || CPU_SUP_AMD)
> depends on MISC_FILESYSTEMS
> select ARCH_HAS_CPU_RESCTRL
> select RESCTRL_FS
> @@ -538,7 +538,7 @@ config X86_CPU_RESCTRL
>
> config X86_CPU_RESCTRL_INTEL_AET
> bool "Intel Application Energy Telemetry"
> - depends on X86_64 && X86_CPU_RESCTRL && CPU_SUP_INTEL && INTEL_PMT_TELEMETRY=y && INTEL_TPMI=y
> + depends on X86_CPU_RESCTRL && CPU_SUP_INTEL && INTEL_PMT_TELEMETRY=y && INTEL_TPMI=y
> help
> Enable per-RMID telemetry events in resctrl.
>

Removing 32bit support is a significant change worth a mention and motivation in the changelog.

> diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
> index 9cd493d467d4..bb84651b14bd 100644
> --- a/arch/x86/include/asm/apic.h
> +++ b/arch/x86/include/asm/apic.h
> @@ -54,6 +54,7 @@ static inline void x86_32_probe_apic(void) { }
> #endif
>
> extern u32 cpuid_to_apicid[];
> +int topo_lookup_cpuid(u32 apic_id);
>
> #define CPU_ACPIID_INVALID U32_MAX
>
> diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
> index 575f8408a9e7..97c2f6bc7a5f 100644
> --- a/arch/x86/include/asm/resctrl.h
> +++ b/arch/x86/include/asm/resctrl.h
> @@ -40,6 +40,8 @@ struct resctrl_pqr_state {
> u32 default_closid;
> };
>
> +bool erdt_enabled(void);
> +

I cannot see why this needs to be in asm header ... why not arch/x86/kernel/cpu/resctrl/internal.h?

> DECLARE_PER_CPU(struct resctrl_pqr_state, pqr_state);
>
> extern bool rdt_alloc_capable;
> diff --git a/arch/x86/kernel/cpu/resctrl/Makefile b/arch/x86/kernel/cpu/resctrl/Makefile
> index 273ddfa30836..2216ee084832 100644
> --- a/arch/x86/kernel/cpu/resctrl/Makefile
> +++ b/arch/x86/kernel/cpu/resctrl/Makefile
> @@ -2,6 +2,7 @@
> obj-$(CONFIG_X86_CPU_RESCTRL) += core.o rdtgroup.o monitor.o
> obj-$(CONFIG_X86_CPU_RESCTRL) += ctrlmondata.o
> obj-$(CONFIG_X86_CPU_RESCTRL_INTEL_AET) += intel_aet.o
> +obj-$(CONFIG_X86_CPU_RESCTRL) += erdt.o
> obj-$(CONFIG_RESCTRL_FS_PSEUDO_LOCK) += pseudo_lock.o
>
> # To allow define_trace.h's recursive include:
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 7667cf7c4e94..90730f0851fa 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -1012,6 +1012,7 @@ static __init void check_quirks(void)
>
> static __init bool get_rdt_resources(void)
> {
> + erdt_init();
> rdt_alloc_capable = get_rdt_alloc_resources();
> rdt_mon_capable = get_rdt_mon_resources();
>
> @@ -1113,7 +1114,7 @@ void resctrl_cpu_detect(struct cpuinfo_x86 *c)
> }
> }
>
> -static int __init resctrl_arch_late_init(void)
> +static int __init __resctrl_arch_late_init(void)
> {
> struct rdt_resource *r;
> int state, ret, i;
> @@ -1156,6 +1157,15 @@ static int __init resctrl_arch_late_init(void)
> return 0;
> }
>
> +static int __init resctrl_arch_late_init(void)
> +{
> + int ret = __resctrl_arch_late_init();
> +
> + if (ret)
> + erdt_exit();
> + return ret;
> +}
> +
> late_initcall(resctrl_arch_late_init);
>
> static void __exit resctrl_arch_exit(void)
> @@ -1165,6 +1175,8 @@ static void __exit resctrl_arch_exit(void)
> cpuhp_remove_state(rdt_online);
>
> resctrl_exit();
> +
> + erdt_exit();
> }
>
> __exitcall(resctrl_arch_exit);
> diff --git a/arch/x86/kernel/cpu/resctrl/erdt.c b/arch/x86/kernel/cpu/resctrl/erdt.c
> new file mode 100644
> index 000000000000..b43dcc06f009
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/resctrl/erdt.c
> @@ -0,0 +1,296 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Enhanced Resource Director Technology(ERDT)

(nit: please add space)

> + *
> + * Copyright (C) 2026 Intel Corporation
> + *
> + */
> +
> +#define pr_fmt(fmt) "resctrl: " fmt
> +
> +#include <linux/cleanup.h>
> +#include <linux/cpu.h>
> +#include <linux/err.h>
> +#include <linux/xarray.h>
> +#include <linux/resctrl.h>
> +#include <linux/acpi.h>

Please sort the includes alphabetically and add empty lines between the groups of headers.

> +#include <asm/apic.h>
> +#include <asm/cpu_device_id.h>

Why is above #include needed?

> +#include "internal.h"
> +


Please add a comment for the enum below to describe what the different members represent.

> +enum erdt_mmio_type {
> + ERDT_MMIO_RMDD_CREG,
> + ERDT_MMIO_CMRC_BASE,
> + ERDT_MMIO_MAX
> +};

Depending on how this enum ends up being used, could it benefit from a pattern like
https://lore.kernel.org/lkml/4a304e6c9ed3f5d92156b8045f5dd4ddba359dfc.1777419024.git.reinette.chatre@intel ?

> +
> +struct erdt_domain_info {
> + void __iomem *base[ERDT_MMIO_MAX];
> + struct acpi_erdt_cmrc *cmrc;
> +};
> +
> +static bool erdt_enabled_flag;

What is significance of "flag"? Can it just be "erdt_enabled"? Since it is a bool this seems
more intuitive?

> +
> +static DEFINE_XARRAY(erdt_domain_xa);

Please add comment to describe what is stored in erdt_domain_xa and what the
index represents.

> +
> +#define ERDT_VALID_VERSION 1
> +#define RMDD_FLAG_CPU_DOMAIN BIT(0)

I think this name can be more accurate, how about, for example "RMDD_FLAG_L3_DOMAIN"?

> +
> +static u32 valid_subtbl_mask;

Please add comment to describe what above global represents.

> +
> +bool erdt_enabled(void)
> +{
> + return erdt_enabled_flag;
> +}
> +
> +/**
> + * get_l3_cache_id_from_cacd - Resolve L3 cache ID from CACD subtable

As I interpret it ... above does not describe to be what this function does.

> + * @cacd: Pointer to the ACPI ERDT CACD structure
> + *
> + * Parses the X2APIC ID list in the given CACD subtable to
> + * identify an online logical CPU and uses it to query the associated
> + * L3 cache ID. The first valid CPU found is used for this lookup.

This function is __init and requires at least one CPU per domain to be online.
What will happen if a system is booted with all CPUs of a domain offline and
then later a CPU is online'd?

> + *
> + * The L3 cache ID is used as a unique domain key for ERDT domain
> + * registration and lookup.
> + *
> + * Return: L3 cache ID for the first matching CPU, or -1 on failure.

Could you please use consistent terminology? Above comments switches between
"online logical CPU" and "valid CPU" and "matching CPU" that all seem to refer
to the same idea.

> + */
> +static __init int get_l3_cache_id_from_cacd(struct acpi_erdt_cacd *cacd)
> +{
> + int num_ids, cpu, online_cpu = -1, cache_id = -1, tmp;
> + struct cacheinfo *ci;
> +
> + if (cacd->header.length < sizeof(*cacd) + sizeof(cacd->X2APICIDS[0])) {

Is above an open code of struct_size()?

> + pr_warn(FW_BUG "Invalid x2apicid CACD table\n");

This patch displays several "FW_BUG" messages but they are sometimes printed with
pr_info() and sometimes with pr_warn(). Is this intentional?

> + return -1;
> + }
> +
> + num_ids = (cacd->header.length - sizeof(*cacd)) / sizeof(cacd->X2APICIDS[0]);
> +
> + guard(cpus_read_lock)();
> +
> + for (int i = 0; i < num_ids; i++) {
> + cpu = topo_lookup_cpuid(cacd->X2APICIDS[i]);
> + if (cpu < 0) {
> + pr_warn(FW_BUG "Unknown x2apicid 0x%x\n", cacd->X2APICIDS[i]);
> +
(nit: unnecessary empty line)


> + return -1;
> + }
> +
> + if (!cpu_online(cpu))
> + continue;
> +
> + tmp = get_cpu_cacheinfo_id(cpu, RESCTRL_L3_CACHE);

>From what I can tell it is tmp that is essentially returned by this function and it
is the result of get_cpu_cacheinfo_id() that returns the cache ID initialized from
CPUID leaf 0x4 ... the function name ("get_l3_cache_id_from_cacd()") and function comment
"Resolve L3 cache ID from CACD subtable" implies differently and the way this is done
makes me wonder why CACD needs to be parsed at all.

resctrl already has a hotplug handler and it dynamically determines the domain ID
based on the cache ID returned from get_cpu_cacheinfo_id(). If the domain ID used by the
new ERDT tables can be, as it appears is done here, trusted to what Linux already
treating as domain ID then why is it needed to parse CACD at all? Why not just
pick the domain ID from RMDD directly and use it directly as domain ID?

Later the domain ID from RMDD is just used in error messages ... never actually used
as a domain ID to guide the implementation so the correlation appears to be implicit with
this CACD parsing not intuitive.

> + if (tmp == -1) {
> + pr_warn(FW_BUG "Can not find L3 cache id for CPU%d\n", cpu);
> + return -1;
> + }
> +
> + if (cache_id == -1)
> + cache_id = tmp;
> +
> + if (tmp != cache_id) {
> + pr_warn(FW_BUG "CACD references multiple L3 cache instances\n");
> + return -1;
> + }
> + online_cpu = cpu;
> + }
> +
> + if (online_cpu == -1)
> + return -1;
> +
> + /*
> + * Check if CACD lists all CPUs in the LLC domain.
> + */
> + ci = get_cpu_cacheinfo_level(online_cpu, RESCTRL_L3_CACHE);
> + if (!ci || num_ids < cpumask_weight(&ci->shared_cpu_map)) {
> + pr_warn(FW_BUG "CACD does not list all the CPUs in L3 domain\n");
> + return -1;
> + }
> +
> + return cache_id;
> +}
> +
> +static void __iomem *erdt_ioremap_checked(phys_addr_t base, u32 size, const char *desc)

"size" -> "num_pages"?

> +{
> + void __iomem *addr = ioremap(base, size << 12);

The 12 magic constant is not ideal. Seems like PAGE_SHIFT is avoided since this *has* to be
4K? Would "num_pages * SZ_4K" be more accurate? To be most robust please consider check_mul_overflow()
or check_shl_overflow().

It is not clear to me what the "_checked" of the function name implies but I do not see
the parameters checked. For example, base could be checked for 4KB alignment?

> +
> + if (!addr) {
> + pr_err("ERDT: Failed to map %s at phys addr %#llx (size: %u pages)\n",

phys_addt_t is printed with %pa

> + desc, (unsigned long long)base, size);
> + }
> + return addr;
> +}
> +
> +static void erdt_iounmap_domain(struct erdt_domain_info *domain)
> +{
> + for (int i = 0; i < ERDT_MMIO_MAX; i++) {
> + if (domain->base[i]) {
> + iounmap(domain->base[i]);
> + domain->base[i] = NULL;
> + }
> + }
> +}
> +
> +static void cleanup_one_domain(struct erdt_domain_info *d)
> +{
> + erdt_iounmap_domain(d);
> + kfree(d);
> +}
> +
> +static __init bool cacd_init(struct acpi_subtbl_hdr_16 *subtbl, int *l3_cache_id)
> +{
> + *l3_cache_id = get_l3_cache_id_from_cacd((struct acpi_erdt_cacd *)subtbl);
> +
> + return *l3_cache_id != -1;
> +}
> +
> +static inline struct acpi_subtbl_hdr_16 *rmdd_subtbl(struct acpi_erdt_rmdd *rmdd)
> +{
> + return (void *)rmdd + sizeof(*rmdd);
> +}
> +
> +static inline struct acpi_subtbl_hdr_16 *next_subtbl(struct acpi_subtbl_hdr_16 *subtbl)
> +{
> + return (void *)subtbl + subtbl->length;
> +}
> +
> +static inline bool subtbl_valid(void *end, struct acpi_subtbl_hdr_16 *subtbl)
> +{

(please see later comment where it seems that it may be possible that this function is
called with end == subtbl and the reference below would thus be past end of table)

> + if (subtbl->length < sizeof(*subtbl))
> + return false;
> +
> + if ((void *)subtbl + subtbl->length > end)
> + return false;
> +
> + return true;
> +}
> +
> +static __init bool parse_rmdd_entry(struct acpi_subtbl_hdr_16 *rmdd_hdr)
> +{
> + struct erdt_domain_info *domain_info;
> + struct acpi_subtbl_hdr_16 *subtbl;
> + struct acpi_erdt_rmdd *rmdd;
> + int l3_cache_id = -1;
> + u32 subtbl_mask = 0;
> +
> + if (rmdd_hdr->length < sizeof(*rmdd)) {
> + pr_info(FW_BUG "Invalid RMDD length %u\n", rmdd_hdr->length);
> + return false;
> + }
> +
> + rmdd = (struct acpi_erdt_rmdd *)rmdd_hdr;
> +
> + /* Quietly ignore non-CPU-based L3 domains */
> + if (!(rmdd->flags & RMDD_FLAG_CPU_DOMAIN))
> + return true;
> +
> + domain_info = kzalloc(sizeof(*domain_info), GFP_KERNEL);

Please use kzalloc_obj().
commit 2932ba8d9c99 ("slab: Introduce kmalloc_obj() and family") contains details about benefits.

> + if (!domain_info)
> + return false;
> +
> + domain_info->base[ERDT_MMIO_RMDD_CREG] =
> + erdt_ioremap_checked(rmdd->creg_base, rmdd->creg_size, "RMDD ctrl base");
> + if (!domain_info->base[ERDT_MMIO_RMDD_CREG])
> + goto cleanup;
> +
> + for (subtbl = rmdd_subtbl(rmdd);
> + subtbl_valid((void *)rmdd + rmdd->header.length, subtbl);

This does not look safe to me. What will happen if RMDD is empty? Would subtbl not
point to end of the table and then subtbl_valid() would dereference pointer beyond the
end of the table to do its checks?

> + subtbl = next_subtbl(subtbl)) {
> + switch (subtbl->type) {
> + case ACPI_ERDT_TYPE_CACD:
> + if (cacd_init(subtbl, &l3_cache_id))
> + subtbl_mask |= BIT(ACPI_ERDT_TYPE_CACD);
> + break;
> + default:
> + break;
> + }
> + }
> +
> + if (l3_cache_id == -1) {
> + pr_info("ERDT: Failed to resolve L3 cache ID for RMDD domain %d\n",
> + rmdd->domain_id);
> +
> + goto cleanup;
> + }
> +
> + /* Require all RMDDs to support same set of sub-tables */
> + if (!valid_subtbl_mask) {
> + valid_subtbl_mask = subtbl_mask;
> + } else if (subtbl_mask != valid_subtbl_mask) {
> + pr_info(FW_BUG "Mismatch domain\n");
> + goto cleanup;
> + }
> +
> + if (xa_insert(&erdt_domain_xa, l3_cache_id, domain_info, GFP_KERNEL)) {
> + pr_info("ERDT: Failed to store domain info for RMDD domain %d\n",
> + rmdd->domain_id);
> + goto cleanup;
> + }
> +
> + return true;
> +
> +cleanup:
> + cleanup_one_domain(domain_info);
> + return false;
> +}
> +
> +static void erdt_cleanup(void)
> +{
> + struct erdt_domain_info *d;
> + unsigned long index;
> +

Looks like the initialization via get_rdt_resources() does not do any error
checking so there may be nothing to clean up and certainly there will be nothing
to clean up if ERDT is not supported. Can the cleanup have a short circuit here to
only clean up if the array is not empty or perhaps skip cleanup if !erdt_enabled?

> + xa_for_each(&erdt_domain_xa, index, d)
> + cleanup_one_domain(d);
> + xa_destroy(&erdt_domain_xa);
> +}
> +
> +static __init int enumerate_erdt_table(struct acpi_table_header *table_hdr)
> +{
> + struct acpi_table_erdt *erdt = (struct acpi_table_erdt *)table_hdr;
> + struct acpi_subtbl_hdr_16 *subtbl;
> + void *table_end;
> +
> + if (erdt->header.revision != ERDT_VALID_VERSION) {
> + pr_info("Unknown ERDT table revision %d\n", erdt->header.revision);
(nit: "Unknown" -> "Unsupported"?)

> + return -EINVAL;
> + }
> +
> + if (erdt->header.length < sizeof(*erdt)) {
> + pr_info(FW_BUG "ERDT: Invalid table length %u\n", erdt->header.length);

Here is an example of a "pr_info()" FW_BUG. It may help to add a unit
and not just report "length"

> + return -EINVAL;
> + }
> +
> + subtbl = (void *)erdt + sizeof(struct acpi_table_erdt);
> + table_end = (void *)erdt + erdt->header.length;
> +
> + while (subtbl_valid(table_end, subtbl)) {
> + if (subtbl->type == ACPI_ERDT_TYPE_RMDD)
> + if (!parse_rmdd_entry(subtbl))
> + goto cleanup;
> +
> + subtbl = next_subtbl(subtbl);
> + }
> +
> + if (xa_empty(&erdt_domain_xa))
> + goto cleanup;
> +
> + erdt_enabled_flag = true;
> +
> + return 0;
> +
> +cleanup:
> + erdt_cleanup();
> + return -EINVAL;
> +}
> +
> +int __init erdt_init(void)
> +{
> + return acpi_table_parse(ACPI_SIG_ERDT, enumerate_erdt_table);
> +}
> +
> +void erdt_exit(void)
> +{
> + erdt_cleanup();
> +}

Why is erdt_exit() and erdt_cleanup() both needed?

Reinette