[Question] Core-to-core CAS latency spikes on AMD EPYC 8534P (Siena) cross-CCD

From: Ionut Nechita

Date: Wed Apr 29 2026 - 12:28:31 EST

Hi all,

I'm evaluating core-to-core CAS latency on an AMD EPYC 8534P (Siena,
Zen 4c, 64 physical cores, 4 CCDs, single socket) for RT workload
planning and I'm observing sporadic latency spikes of 3-6us on
cross-CCD core pairs, while the baseline cross-CCD latency is
~250-280ns.

I'd like to understand whether this is expected Infinity Fabric behavior
under coherency contention, or if there's something on the software side
that could be contributing.

Hardware
========

CPU: AMD EPYC 8534P (Zen 4c, Siena)
Cores: 64C/128T, single socket, SP6
Topology: 4 CCDs, each with 2 CCXs of 8 cores
L3 cache: 16 MB per CCX (128 MB total)
shared_cpu_list per L3: 8 cores + 8 HT siblings
(e.g., cpu0 L3 shared with cores 0-7,64-71)
NUMA nodes: 4 (one per CCD)
node0: cores 0-15, HT 64-79
node1: cores 16-31, HT 80-95
node2: cores 32-47, HT 96-111
node3: cores 48-63, HT 112-127

Kernel
======

Version: 6.12.57-1.stx.136 (PREEMPT_RT)
cpuidle: driver=none (idle uses mwait, C1 only)
cpufreq: acpi-cpufreq, performance governor, boost enabled (~3.1GHz)
SMI count: 0 on all cores (verified via turbostat)
cmdline: nohz_full=1-63,65-127 rcu_nocbs=1-63,65-127
kthread_cpus=0,64 irqaffinity=1-63,65-127
iommu=pt nopti nospectre_v2 nospectre_v1

Test tool
=========

core-to-core-latency (CAS on single shared cache line)
https://github.com/nviennot/core-to-core-latency
1000 iterations per sample, 300 samples

Results (all 128 threads active, taskset -c 0-127)
===================================================

Intra-CCX (same 8-core cluster): ~62-100ns
Cross-CCX (same CCD, 16 cores): ~230-270ns
Cross-CCD (typical): ~250-300ns
Cross-CCD (sporadic spikes): 3000-6600ns

Min latency: 19.4ns (HT sibling pair)
Max latency: 3056ns (cross-CCD, cores 122,64)
Mean: 254.8ns

Results (single CCD only, taskset -c 16-31)
============================================

Intra-CCX: ~62-96ns
Cross-CCX: ~230-262ns
Max: 262ns
No spikes above 270ns

For comparison, Intel Xeon 6338N (Ice Lake SP, 32 cores, mesh
interconnect) on the same test shows 51-147ns with no spikes above
150ns across three consecutive runs.

Key observations
================

- Spikes only appear when all 4 CCDs are active simultaneously
- Both cores in a measured pair are actively spinning (CAS loop),
so this is not a wakeup-from-idle issue
- cpuidle driver is "none", idle uses mwait (C1 equivalent)
- No SMIs detected
- Restricting to a single CCD eliminates all spikes

Questions
=========

1. Are multi-us latency spikes on cross-CCD atomic operations a known
characteristic of Infinity Fabric under heavy coherency traffic
from all CCDs simultaneously?

2. For RT workloads requiring cross-CCD communication, is there any
kernel-side mitigation (scheduler hints, IF QoS, bandwidth
partitioning via MBA/L3 CAT) beyond pinning to a single CCD?

3. Could this be related to snoop filter capacity on the IO die
(given the 16MB L3 per CCX), or is IF arbitration the more likely
bottleneck?

Not reporting a bug - looking for confirmation that this is expected
hardware behavior and any guidance for multi-CCD RT deployments on
EPYC Siena.

Thanks,
Ionut Nechita