Re: 8aeb879baf12 - significant system call latency regression, bisected
From: H. Peter Anvin
Date: Mon Jun 15 2026 - 14:46:49 EST
On 2026-06-14 20:41, Linus Torvalds wrote:
On Mon, 15 Jun 2026 at 07:38, H. Peter Anvin <hpa@xxxxxxxxx> wrote:
- Since we no longer use the sys_call_table[] as a jump table,
do we actually need array_index_nospec()? in do_syscall_x64|32?
Well, gcc will still generate a jump table from it when retpolines
aren't enabled.
So I think we do want that array_index_nospec. It should be cheap
insurance against the simplest kinds of speculation issues.
Well, we could put it under an #ifdef by putting macro to detect when we use -fno-jump-tables. PeterZ and I have also been talking about making
-fno-jump-tables unconditional, because at some point we found that the performance difference was negligible, at least when array_index_nospec() is necessary, and it makes it a lot easier to tune when you don't have to deal with code bases that compile. It is not just retpoline but also IBT (although the comment says "for now"); this of course means in practice that the kernels everyone uses are compiled without jump tables.
The system call dispatch is really the biggest case here.
It does, however, make me think that using regs->ax to dispatch system calls in the a FRED path might actually be The Wrong Thing[TM]; FRED delivery is a speculation barrier and so %rax is guaranteed to be stable at that point. *In practice* the stack engine probably would propagate that (I can't really think of any way to implement a stack engine that wouldn't, and I suspect if it didn't we would have lots of other issues) but instead of dumping it into memory and reading it back it probably would be better to do what the SYSCALL path does and move it into an argument register instead.
I have experimented with micro-optimizations of the FRED path lately, in part because FRED inherently does provide speculation guarantees that SYSCALL/SYSRET do not, in part because some of the code paths have a fair bit of unnecessary overhead in general of which some of affects FRED disproportionately (some duplicates work that FRED does inherently, for one thing.) So far I have been somewhat surprised how *little* effect some of them have had; clearly branch prediction does a really good job sometimes even without static branches.
Still, some pretty simple changes can get a few percent improvement, well above the statistical noise margin.
Doing a *very* early-out and dispatching do_syscall_64() already in asm_entry_point_user is one of the more effective hacks; I am (or rather, were, until I discovered this immediate issue ;) also experimenting with having separate IDT and FRED versions of do_syscall_64() -- the code factors very cleanly and the duplication is nearly all at the object code level.
Part of my questions to PeterZ was because I believe that inlining x64_sys_call() will benefit a fair bit from better code layout. We have talked about sunsetting x32, but until we do, merging x32_sys_call() into the same function also ends up with the two switch statements being able to share a fair bit of code, since there are large contiguous chunks of x32 system call space which are the same as x64.
One of the things I have been thinking about, too, is to move FRED- and IDT-specific code into separate text sections; not only so that they can be close together in memory, but also so that we can poison out the areas that aren't being used. Every code flow that has almost unlimited versatility is, obviously, *extremely* desirable as targets for execution redirection attacks...
-hpa