Re: 8aeb879baf12 - significant system call latency regression, bisected
From: H. Peter Anvin
Date: Fri Jun 19 2026 - 00:33:38 EST
On 2026-06-18 19:11, Linus Torvalds wrote:
On Thu, 18 Jun 2026 at 19:08, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
Considering that your load should all be hot-cache anyway, maybe the
problem was never the alignment - which, if the load really is in the
cache, shouldn't matter that much anyway, and certainly not 90 cycles
worth.
Ahh, your numbers said 80 cycles, not 90, I mis-remembered.
Still - same thing. 80 or 90 cycles is not a "L2 miss". That really
smells like a "front end flush" kind of number where you have started
lots of work and are throwing it all away to restart it.
Well, I'll be darned.
It's not system call dispatch that's at fault, at all. It was a complete red herring. I got misled because of how reproducible it was.
The patch affected the RCU mechanics inside getppid -> __task_pid_nr_ns. I can't think of any reason that would have happened other than alignment, but that shouldn't be an important code path either way.
I have been using getppid() as a standard "trivial system call" for microbenchmarking for so long that it never occurred to question it, but it *does* depend on RCU, and it definitely stood out in the profiles.
Switching to __NR__sysctl -> sys_ni_syscall() instead shows that 7.1 is actually slightly better (by a few cycles) than 7.0, which also matches what Tony Luck has seen on IDT. I tested Peter Z's patch as well; it added a few cycles over 7.1 baseline, but I think it can be used as a base for improvements (I spotted a couple of possibilities already, like it calling instrumentation_begin..._end twice.)
Either way, I apologize for the false alarm. Now returning to your regularly scheduled kernel hacking...
-hpa