Re: [RFC v2 PATCH 7/7] x86/entry: use int for syscall number; handle all invalid syscall nrs

From: H. Peter Anvin
Date: Wed May 12 2021 - 15:37:49 EST


On 5/12/21 1:51 AM, Ingo Molnar wrote:

I've applied patches 1-6, thanks Peter!

Wrt. patch #7 - the changelog is hedging things a bit and the changes are
non-trivial. Does this patch (intend to) change any actual observable
behavior in the system call interface, and if yes, in which areas?

Or is this a pure cleanup with no observable changes expected?


I'll clean up the comments a bit [including per tglx' review.] I'm writing this email in part to organize my own thoughts in how to better explain the motivation for the patch, and the extent of visible differences.

Right now, *some* code will treat e.g. 0x0000000100000001 as a system call and some will not. Some of the code, notably in ptrace, will treat 0x000000018000000 as a system call and some will not. Finally, right now, e.g. 335 for x86-64 will force the exit code to be set to -ENOSYS even if poked by ptrace, but 548 will not, because there is an observable difference between an out of range system call and a system call number that falls outside the range of the table.

The use of a non-system-call number in a system call can come in in a few ways:

1. Via ptrace;
2. From seccomp;
3. By explicitly passing %eax = -1 to a system call.

#3 isn't really a problem *unless* it for some reason confuses ptrace or seccomp users -- we could do an early-out for it.

For ptrace and seccomp, we enter with the return value (regs->ax) set to -ENOSYS, regardless of if the system call number is valid or not. ptrace/seccomp has the option of independently emulate a system call, then set regs->orig_ax to -1 and provide any chosen return value in regs->ax. In that case, we must *not* update regs->ax to -ENOSYS before returning.

The arch-independent code all assumes that a system call is "int" that the value -1 specifically and not just any negative value is used for a non-system call. This is the case on x86 as well when arch-independent code is involved. The arch-independent API is defined/documented (but not *implemented*!) in <asm-generic/syscall.h>, and what this patch is intended to do is to make the x86-specific code follow.

As everyone either does or should know, treating the same data in different ways in different flows is a security hole just waiting to happen.

Most of these are relatively recently introduced bugs in x86-64; the original assembly code version zero-extended %rax, compared it against the length of the system call table, and would unconditionally return -ENOSYS otherwise. Then *at least* on two separate occasions someone "optimized" it by removing "movl %eax, %eax" not understanding that this is not a noop in x86-64 but a zero-extend (perhaps gas ought to be able to allow movzlq as an alias...) introducing a critical security bug which was one of the motivators for the SMAP CPU feature.

On x86-64, the ABI is that the callee is responsible for extending parameters, so only examining the lower 32 bits is fully consistent with any "int" argument to any system call, e.g. regs->di for write(2).

Andy L. and I had a fairly long discussion about this, and some of this was updated after the first RFC, but I fully agree I'm not capturing it all that well. I hope the above points clear things up better and I'll rewrite this into a better patch description.

-hpa