Re: 8aeb879baf12 - significant system call latency regression, bisected

From: Ingo Molnar

Date: Wed Jun 17 2026 - 06:03:35 EST

* H. Peter Anvin <hpa@xxxxxxxxx> wrote:

> It's cache hot, calling getppid() in a tight loop.
> The units are renormalized to from TSC cycles to
> core cycles using fixed counter 1 to determine the
> actual ratio.

Hm, in that light the 80 cycles overhead from a single
misaligned symbol is rather surprising (to me): it's
way too high to be reasonably caused by any hot cache
alignment effects - and all of the regular instruction
caches (or even data caches) should be more than large
enough to fit such a getppid() benchmark fully into the
cache.

Would be nice to see a before/after perf stat --repeat <N>
figures with sufficiently high <N> to get <0.1% stddev?

And just to guess around a bit, here's the various caches,
buffers and queues on a Panther Lake Performance Core
(Cougar Cove) that may play a role:

- L0 Data Cache (L0D) 48 KB 68 cachelines
- L1 Data Cache (L1D) 192 KB 3,072 cachelines
- L1 Instruction Cache (L1I) 64 KB 1,024 cachelines
- L2 Cache 3,072 KB 49,152 cachlines
- uOP Cache (Micro-op Cache) - ~5,250 uOPs, ~64 sets x 10-12-way
- uOP Queue - 192 entries
- Reorder Buffer (ROB) - 576 entries
- L1 Data TLB (DTLB) - 128 entries
- L2 Shared TLB (STLB) - ~4,096 entries
- Return Stack Buffer (RSB) - 24 entries
- Load Queue - ~114 entries
- Store Queue - ~56 entries

Where all cacheline sizes are 64 bytes, and a uOP cache 'set'
fits up to 6-8 uops.

I think with a cache-hot syscall benchmark we can exclude the
largest caches with over 1,000 effective entries with near
certainty as a factor, so what is left are:

- L0 Data Cache (L0D) 48 KB 68 cachelines
- uOP Cache (Micro-op Cache) - ~5,250 uOPs ~64 sets x 10-12-way
- uOP Queue - 192 entries
- Reorder Buffer (ROB) - 576 entries
- L1 Data TLB (DTLB) - 128 entries
- Return Stack Buffer (RSB) - 24 entries
- Load Queue - ~114 entries
- Store Queue - ~56 entries

I'd exclude the L0D, L1DTLB, the RSB and the load/store queues
as well, because code alignment of a single symbol should have
a minimal effect on them, which leaves:

- uOP Queue - 192 entries
- uOP Cache (Micro-op Cache) - ~5,250 uOPs, ~64 sets x 10-12 way
- Reorder Buffer (ROB) - 576 entries

And I think of these the main suspect would be the uOP cache,
because its (estimated...) ~10-12 deep associativity limit
of uop-sets may be something this benchmark is hitting on
Panther Lake?

Could it be that the extra alignment adds +1 to the maximum number
of uOP cache 'ways' this execution hits in the uOP cache, moving
it form say 12 (still fits) to 13 (misses) so that this particular
uOP cache association depth starts trashing? But I'm really just
guessing wildly here...

( The extra statistical noise of the regressed figures does suggest
some sort of trashing mechanic behind the scenes though, and the
regular caches seem large enough to not actually trash for such
a cache-hot benchmark. )

Or am I missing something obvious?

Any perf stat uOP related counter measurements might be elluminating.

Thanks,

Ingo