Re: 8aeb879baf12 - significant system call latency regression, bisected

From: Xin Li

Date: Sun Jun 14 2026 - 14:10:01 EST

> On Jun 13, 2026, at 6:50 PM, H. Peter Anvin <hpa@xxxxxxxxx> wrote:
>
> On 2026-06-13 16:52, H. Peter Anvin wrote:
>> On 2026-06-13 13:34, H. Peter Anvin wrote:
>>> On 2026-06-13 01:59, Peter Zijlstra wrote:
>>>> On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) wrote:
>>>>> So I was trying to figure out a significant -- about 13% -- increase
>>>>> in system call latency between v7.0 and the current master, and it
>>>>> bisects down to:
>>>>>
>>>>> 8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build
>>>>>
>>>>> This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
>>>>> is a bare metal boot, no KVM.
>>>>>
>>>>> I'm personally extremely puzzled how this could possibly be related,
>>>>> and I will be investigating the possibility that this is a false
>>>>> bisect, but it is not a Heisenbug in any way; it has been extremely
>>>>> reproducible, and the difference is statistically valid by close to 10
>>>>> sigma. Futhermore, the bisection at least gave the appearance of
>>>>> stability.
>>>>>
>>>>> Given how late in the cycle this is I wanted to send an alert sooner
>>>>> rather than later; I will update as I get more data.
>>>>
>>>> Uhm, massive WTF indeed. I don't immediately see how this could possibly
>>>> affect a FRED host either, except perhaps in code layout.
>>>>
>>>> I don't actually have a FRED capable machine, but have you tried running
>>>> one of those top-down perf things on it, to see where its hurting?
>>>
>>> Not yet, but I'm investigating right now (I have some family obligations this weekend, so my duty cycle is somewhat limited.)
>>>
>>> I reverted the patch on top of rc7, and it did, in fact, fix the regression,
>> > but I'm doing a clean from-scratch rebuild of both trees to make sure
>> > there isn't anything in my test setup that could introduce any kind of
>> > "memory" between builds...>
>> Nope, even with the clean rebuild it is 100% reproducible. It is in fact worse than I originally stated: the average with 7.1rc7 is 478±6 cycles (with the top and bottom octiles removed as outlier protection); with 7.1rc7 with the above patch reverted it is 397.5±0.4. - this is in fact a 20% increase in latency, not 13%...
>
> OK, I have, I believe root-caused this.
>
> It is a padding issue; removing the code changes __pfx_x64_sys_call to be 32-byte aligned, with the result that x64_sys_call gets *mis*aligned.
>
> Reverting the patch but adding an alignment statement to x64_sys_call re-introduces the performance regression.

The problem doesn’t happen to IDT?

>
> I am concerned because this could mean that the __pfx stubs add substantial overhead elsewhere, unless this just happens to be a particularly sensitive case...

Good point, alignment check should be applied to all such entries.

Thanks
Xin