Re: 8aeb879baf12 - significant system call latency regression, bisected

From: H. Peter Anvin

Date: Sat Jun 13 2026 - 22:09:07 EST

On 2026-06-13 16:52, H. Peter Anvin wrote:

On 2026-06-13 13:34, H. Peter Anvin wrote:

On 2026-06-13 01:59, Peter Zijlstra wrote:

On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) wrote:
So I was trying to figure out a significant -- about 13% -- increase
in system call latency between v7.0 and the current master, and it
bisects down to:

8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build

This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
is a bare metal boot, no KVM.

I'm personally extremely puzzled how this could possibly be related,
and I will be investigating the possibility that this is a false
bisect, but it is not a Heisenbug in any way; it has been extremely
reproducible, and the difference is statistically valid by close to 10
sigma. Futhermore, the bisection at least gave the appearance of
stability.

Given how late in the cycle this is I wanted to send an alert sooner
rather than later; I will update as I get more data.

Uhm, massive WTF indeed. I don't immediately see how this could possibly
affect a FRED host either, except perhaps in code layout.

I don't actually have a FRED capable machine, but have you tried running
one of those top-down perf things on it, to see where its hurting?

Not yet, but I'm investigating right now (I have some family obligations this weekend, so my duty cycle is somewhat limited.)

I reverted the patch on top of rc7, and it did, in fact, fix the regression,
> but I'm doing a clean from-scratch rebuild of both trees to make sure
> there isn't anything in my test setup that could introduce any kind of
> "memory" between builds...>
Nope, even with the clean rebuild it is 100% reproducible. It is in fact worse than I originally stated: the average with 7.1rc7 is 478±6 cycles (with the top and bottom octiles removed as outlier protection); with 7.1rc7 with the above patch reverted it is 397.5±0.4. - this is in fact a 20% increase in latency, not 13%...

OK, I have, I believe root-caused this.

It is a padding issue; removing the code changes __pfx_x64_sys_call to be 32-byte aligned, with the result that x64_sys_call gets *mis*aligned.

Reverting the patch but adding an alignment statement to x64_sys_call re-introduces the performance regression.

I am concerned because this could mean that the __pfx stubs add substantial overhead elsewhere, unless this just happens to be a particularly sensitive case...

-hpa