Re: 8aeb879baf12 - significant system call latency regression, bisected

Next message: Linus Torvalds: "Re: 8aeb879baf12 - significant system call latency regression, bisected"
Previous message: Alexey Kardashevskiy: "Re: [PATCH v6 00/20] dma-mapping: Use DMA_ATTR_CC_SHARED through direct, pool and swiotlb paths"
In reply to: H. Peter Anvin: "Re: 8aeb879baf12 - significant system call latency regression, bisected"
Next in thread: Linus Torvalds: "Re: 8aeb879baf12 - significant system call latency regression, bisected"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

From: Linus Torvalds

Date: Thu Jun 18 2026 - 22:09:05 EST

On Thu, 18 Jun 2026 at 18:18, H. Peter Anvin <hpa@xxxxxxxxx> wrote:
>
> It *does* indeed align x64_sys_call() now, but the performance is firmly
> in the "bad" bucket, higher variance and all...

Considering that your load should all be hot-cache anyway, maybe the
problem was never the alignment - which, if the load really is in the
cache, shouldn't matter that much anyway, and certainly not 90 cycles
worth.

I bet you're hitting some pipeline exception for some corner case.

Like a store that gets mistakenly taken to cache an instruction cache
flush because the core thinks you're doing self-modifying code.

And then the issue isn't alignment in the I$, it's "I$ line happens to
alias with a D$ write hit" and you just hit an unlucky pattern.

Perf counters for pipeline flushes? Ask your favorite CPU architect
person, because there's probably tens of different cases of that
"unlucky situation that causes the front-end to flush".

The FRED system call entry part might actually make some of those
worse, since presumably Intel has learnt a lesson and are flushing
various prediction caches on FRED entry.

But that also means that if it happens once, it happens *every* time,
because it also flushes the prediction data that would keep it from
happening in normal loads.

Linus