Re: 8aeb879baf12 - significant system call latency regression, bisected

From: Ingo Molnar

Date: Tue Jun 16 2026 - 05:51:28 EST



* Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:

> On Tue, 16 Jun 2026 at 13:58, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >
> > So ISTR the Intel I-fetch window was 16 bytes, so the above things would
> > make sense. However, Gemini, or whatever AI sits in google search, is
> > trying to tell me Intel moved to 32 byte I-fetch with Alderlake.
>
> Even with 16-byte fetch, the cacheline size is 64 bytes, so it hurts
> to not be 64-byte aligned - simply because you may need to fetch more
> cachelines (assuming fairly linear code).
>
> And afaik, some of the newer ones aren't 32-byte wide, but can do 48
> bytes as three 16-byte fetches.
>
> But I don't know if they can do the old "split line access" that older
> cores could do, where a Pentium would do two 8-byte accesses at the
> same time, and they didn't have to be in the same cache line.
>
> So 64-byte alignment would always be the best option if you only look
> at a *particular* piece of code.
>
> But it obviously is very wasteful and hurts when there is code around
> it that could be loaded into the cache at the same time.
>
> So almost certainly not a good idea in general.
>
> But 64-byte alignment is probably what things like interrupt and
> system call entrypoints should use, because those things would make
> sense to look at as isolated things, not part of a bigger load". And
> they are quite likely to start from a fairly cold-cache situation.
>
> So *not* some general compiler option in a config file, but maybe a
> special "entry point alignment" macro?

Yeah, agreed on that approach - but before/while we fix it,
I'm also still somewhat baffled by the numbers hpa reported:

>>> Nope, even with the clean rebuild it is 100% reproducible. It is in fact
>>> worse than I originally stated: the average with 7.1rc7 is 478±6 cycles
>>> (with the top and bottom octiles removed as outlier protection); with 7.1rc7
>>> with the above patch reverted it is 397.5±0.4. - this is in fact a 20%
>>> increase in latency, not 13%...

Now that we know that this regression is caused by entry function
alignment changes, do we know *why* it causes a 80 cycles
shift in system call entry performance?

What does the benchmark measure, cache-cold or cache-hot
execution?

1) Cache-cold performance:

If it is cold-cache performance, does the misaligned case fetch
one more cold cacheline?

>From which cache does it miss? Fetching from the 2-4MB Panther Lake
L2 shouldn't be 80 cycles, it should be ~17 cycles.

If it's fetching from the 18MB L3 (which I'd say is the norm for
most workloads), then the L3->L1I latency is around ~55 cycles on
Panther Lake, with everything included.

It cannot really be DRAM latency, ie. true cache-cold latency,
as that would be much more severe, in the 400 cycles range even
with premium DRAM modules - and more like 500 cycles with
mainstream DRAM modules and layouts. (Unless we are *lucky* with
alignment and sizing and the alignment regression doesn't trigger
full DRAM latency.) The on-die DRAM MSC cache's latency should
be around 300 cycles - that too is too high.

2) Cache-hot performance:

While cache-hot performance is less relevance for system calls
(which tend to be cache-cold in practice), if the benchmark
measures cache-hot performance, why is there a 80 cycles shift
from just a single misaligned symbol?

Ie. the specific and rather stable figure of 80 cycles overhead
does not seem to match any of the Panther Lake latencies that
ought to be relevant to this regression, if we use the simplest
mental model of what's going on when alignment changes.

So it is either some other uarch pathology, triggered by bad
alignment, or something doesn't add up in my mental model
of the root cause of this problem. :-)

Side notes:

- The 6 cycles noise in the 478±6 cycles measurement
does suggest that we might have missed out to a
deeper cache hierarchy level, versus the rather
stable 397.5±0.4 pre-regression figure.

- I'm also assuming that 'cycles' here is a frequency-invariant
standardized constant 5.1 GHz TSC value or so?

Thanks,

Ingo