Re: 8aeb879baf12 - significant system call latency regression, bisected

From: H. Peter Anvin

Date: Tue Jun 16 2026 - 13:45:26 EST

On June 16, 2026 2:51:12 AM PDT, Ingo Molnar <mingo@xxxxxxxxxx> wrote:
>
>* Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
>> On Tue, 16 Jun 2026 at 13:58, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>> >
>> > So ISTR the Intel I-fetch window was 16 bytes, so the above things would
>> > make sense. However, Gemini, or whatever AI sits in google search, is
>> > trying to tell me Intel moved to 32 byte I-fetch with Alderlake.
>>
>> Even with 16-byte fetch, the cacheline size is 64 bytes, so it hurts
>> to not be 64-byte aligned - simply because you may need to fetch more
>> cachelines (assuming fairly linear code).
>>
>> And afaik, some of the newer ones aren't 32-byte wide, but can do 48
>> bytes as three 16-byte fetches.
>>
>> But I don't know if they can do the old "split line access" that older
>> cores could do, where a Pentium would do two 8-byte accesses at the
>> same time, and they didn't have to be in the same cache line.
>>
>> So 64-byte alignment would always be the best option if you only look
>> at a *particular* piece of code.
>>
>> But it obviously is very wasteful and hurts when there is code around
>> it that could be loaded into the cache at the same time.
>>
>> So almost certainly not a good idea in general.
>>
>> But 64-byte alignment is probably what things like interrupt and
>> system call entrypoints should use, because those things would make
>> sense to look at as isolated things, not part of a bigger load". And
>> they are quite likely to start from a fairly cold-cache situation.
>>
>> So *not* some general compiler option in a config file, but maybe a
>> special "entry point alignment" macro?
>
>Yeah, agreed on that approach - but before/while we fix it,
>I'm also still somewhat baffled by the numbers hpa reported:
>
>>>> Nope, even with the clean rebuild it is 100% reproducible. It is in fact
>>>> worse than I originally stated: the average with 7.1rc7 is 478±6 cycles
>>>> (with the top and bottom octiles removed as outlier protection); with 7.1rc7
>>>> with the above patch reverted it is 397.5±0.4. - this is in fact a 20%
>>>> increase in latency, not 13%...
>
>Now that we know that this regression is caused by entry function
>alignment changes, do we know *why* it causes a 80 cycles
>shift in system call entry performance?
>
>What does the benchmark measure, cache-cold or cache-hot
>execution?
>
>1) Cache-cold performance:
>
>If it is cold-cache performance, does the misaligned case fetch
>one more cold cacheline?
>
>From which cache does it miss? Fetching from the 2-4MB Panther Lake
>L2 shouldn't be 80 cycles, it should be ~17 cycles.
>
>If it's fetching from the 18MB L3 (which I'd say is the norm for
>most workloads), then the L3->L1I latency is around ~55 cycles on
>Panther Lake, with everything included.
>
>It cannot really be DRAM latency, ie. true cache-cold latency,
>as that would be much more severe, in the 400 cycles range even
>with premium DRAM modules - and more like 500 cycles with
>mainstream DRAM modules and layouts. (Unless we are *lucky* with
>alignment and sizing and the alignment regression doesn't trigger
>full DRAM latency.) The on-die DRAM MSC cache's latency should
>be around 300 cycles - that too is too high.
>
>2) Cache-hot performance:
>
>While cache-hot performance is less relevance for system calls
>(which tend to be cache-cold in practice), if the benchmark
>measures cache-hot performance, why is there a 80 cycles shift
>from just a single misaligned symbol?
>
>Ie. the specific and rather stable figure of 80 cycles overhead
>does not seem to match any of the Panther Lake latencies that
>ought to be relevant to this regression, if we use the simplest
>mental model of what's going on when alignment changes.
>
>So it is either some other uarch pathology, triggered by bad
>alignment, or something doesn't add up in my mental model
>of the root cause of this problem. :-)
>
>Side notes:
>
> - The 6 cycles noise in the 478±6 cycles measurement
> does suggest that we might have missed out to a
> deeper cache hierarchy level, versus the rather
> stable 397.5±0.4 pre-regression figure.
>
> - I'm also assuming that 'cycles' here is a frequency-invariant
> standardized constant 5.1 GHz TSC value or so?
>
>Thanks,
>
> Ingo

It's cache hot, calling getppid() in a tight loop. The units are renormalized to from TSC cycles to core cycles using fixed counter 1 to determine the actual ratio.