Re: 8aeb879baf12 - significant system call latency regression, bisected

From: H. Peter Anvin

Date: Sun Jun 14 2026 - 14:32:36 EST

On June 14, 2026 11:08:59 AM PDT, Xin Li <xin@xxxxxxxxx> wrote:
>
>> On Jun 13, 2026, at 6:50 PM, H. Peter Anvin <hpa@xxxxxxxxx> wrote:
>>
>> On 2026-06-13 16:52, H. Peter Anvin wrote:
>>> On 2026-06-13 13:34, H. Peter Anvin wrote:
>>>> On 2026-06-13 01:59, Peter Zijlstra wrote:
>>>>> On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) wrote:
>>>>>> So I was trying to figure out a significant -- about 13% -- increase
>>>>>> in system call latency between v7.0 and the current master, and it
>>>>>> bisects down to:
>>>>>>
>>>>>> 8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build
>>>>>>
>>>>>> This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
>>>>>> is a bare metal boot, no KVM.
>>>>>>
>>>>>> I'm personally extremely puzzled how this could possibly be related,
>>>>>> and I will be investigating the possibility that this is a false
>>>>>> bisect, but it is not a Heisenbug in any way; it has been extremely
>>>>>> reproducible, and the difference is statistically valid by close to 10
>>>>>> sigma. Futhermore, the bisection at least gave the appearance of
>>>>>> stability.
>>>>>>
>>>>>> Given how late in the cycle this is I wanted to send an alert sooner
>>>>>> rather than later; I will update as I get more data.
>>>>>
>>>>> Uhm, massive WTF indeed. I don't immediately see how this could possibly
>>>>> affect a FRED host either, except perhaps in code layout.
>>>>>
>>>>> I don't actually have a FRED capable machine, but have you tried running
>>>>> one of those top-down perf things on it, to see where its hurting?
>>>>
>>>> Not yet, but I'm investigating right now (I have some family obligations this weekend, so my duty cycle is somewhat limited.)
>>>>
>>>> I reverted the patch on top of rc7, and it did, in fact, fix the regression,
>>> > but I'm doing a clean from-scratch rebuild of both trees to make sure
>>> > there isn't anything in my test setup that could introduce any kind of
>>> > "memory" between builds...>
>>> Nope, even with the clean rebuild it is 100% reproducible. It is in fact worse than I originally stated: the average with 7.1rc7 is 478±6 cycles (with the top and bottom octiles removed as outlier protection); with 7.1rc7 with the above patch reverted it is 397.5±0.4. - this is in fact a 20% increase in latency, not 13%...
>>
>> OK, I have, I believe root-caused this.
>>
>> It is a padding issue; removing the code changes __pfx_x64_sys_call to be 32-byte aligned, with the result that x64_sys_call gets *mis*aligned.
>>
>> Reverting the patch but adding an alignment statement to x64_sys_call re-introduces the performance regression.
>
>
>The problem doesn’t happen to IDT?
>
>
>>
>> I am concerned because this could mean that the __pfx stubs add substantial overhead elsewhere, unless this just happens to be a particularly sensitive case...
>
>
>Good point, alignment check should be applied to all such entries.
>
>Thanks
> Xin

The problem is that if you put an alignment directive on a function, it aligns the __pfx stub, which is exactly The Wrong Thing™.

Otherwise this would be easy to fix, permanently.

I haven't had time to test IDT yet. I assume it is similar.