[Bug] x86/resctrl: unexpect mbm_local_bytes/mbm_total_bytes delta on AMD with multiple RMIDs in the same domain

From: Hc Zheng
Date: Tue Jul 29 2025 - 03:55:39 EST


Hi All,

We have enable resctrl on container platform. We notice some unexpect
behaviors when multiple containers running in the same L3 domain.
the mbm_local_bytes/mbm_total_bytes for such mon_groups return
Unavailable or delta with two consecutive reads is out of normal range
(eg: 1000+GB/s)

after reading the AMD pqos manual(), it says
"""
Potential causes of the “U” bit being set include
(but are not limited to):

• RMID is not currently tracked by the hardware.
• RMID was not tracked by the hardware at some time since it was last read.
• RMID has not been read since it started being tracked by the hardware.
"""

but no explanations for unexpect large delta between 2 reads of the
counters. After exam the kernel code, I suspect this would more likely
to be a hardware bugs

here are the steps to reproduce it

1. create mon_groups

$ for i in `seq 0 99`;do mkdir -p /sys/fs/resctrl/amdtest/mon_groups/test$i;done

2. run stress command and assigned such pid to each mon_groups , (I
have run such test on AMD Genoa. cpu 16-23,208-215 is on CCD 8)

$ cat stress.sh
nohup numactl -C 16-23,208-215 stress -m 1 --vm-hang 1 > /dev/null &
lastPid=$!
echo $lastPid > /sys/fs/resctrl/amdtest/tasks
echo $lastPid > /sys/fs/resctrl/amdtest/mon_groups/test$1/tasks
$ for i in `seq 0 99`;do bash stress.sh $i ;done

3. watch the resctrl counter every 10 seconds

$ while true ;do cat
/sys/fs/resctrl/amdtest/mon_groups/test9/mon_data/mon_L3_08/mbm_local_bytes;sleep
10;done

...
Unavailable
Unavailable
Unavailable
61924495182825856
64176294690029568
Unavailable
Unavailable
Unavailable
...

at some point the delta for 2 consecutive reads is out of normal
range, (64176294690029568 - 61924495182825856) / 1024 / 1024 / 1024 /
10 = 209715 Gb/s

if I lower the concurrecy to like 59 or lower, the delta is in normal
range, and never return Unavailable. I have also tested on amd Rome
cpu, the problem still existed.
I have try this on intel platform, It does not have such problem, with
even over 200+ RMIDs concurrently being monitored.

I can not find any documents about max RMID for AMD hardware can
concurrently holds, or a explanations for such problems.
I believe this could become even severe on AMD with more threads in
the future, as we will run more workloads on a single server

Can some one help me to solve this problem, thanks


Best Regards
Huaicheng Zheng