Re: [RFC v2 PATCH 7/7] x86/entry: use int for syscall number; handle all invalid syscall nrs
From: H. Peter Anvin
Date: Wed May 12 2021 - 15:37:49 EST
On 5/12/21 1:51 AM, Ingo Molnar wrote:
I've applied patches 1-6, thanks Peter!
Wrt. patch #7 - the changelog is hedging things a bit and the changes are
non-trivial. Does this patch (intend to) change any actual observable
behavior in the system call interface, and if yes, in which areas?
Or is this a pure cleanup with no observable changes expected?
I'll clean up the comments a bit [including per tglx' review.] I'm
writing this email in part to organize my own thoughts in how to better
explain the motivation for the patch, and the extent of visible differences.
Right now, *some* code will treat e.g. 0x0000000100000001 as a system
call and some will not. Some of the code, notably in ptrace, will treat
0x000000018000000 as a system call and some will not. Finally, right
now, e.g. 335 for x86-64 will force the exit code to be set to -ENOSYS
even if poked by ptrace, but 548 will not, because there is an
observable difference between an out of range system call and a system
call number that falls outside the range of the table.
The use of a non-system-call number in a system call can come in in a
few ways:
1. Via ptrace;
2. From seccomp;
3. By explicitly passing %eax = -1 to a system call.
#3 isn't really a problem *unless* it for some reason confuses ptrace or
seccomp users -- we could do an early-out for it.
For ptrace and seccomp, we enter with the return value (regs->ax) set to
-ENOSYS, regardless of if the system call number is valid or not.
ptrace/seccomp has the option of independently emulate a system call,
then set regs->orig_ax to -1 and provide any chosen return value in
regs->ax. In that case, we must *not* update regs->ax to -ENOSYS before
returning.
The arch-independent code all assumes that a system call is "int" that
the value -1 specifically and not just any negative value is used for a
non-system call. This is the case on x86 as well when arch-independent
code is involved. The arch-independent API is defined/documented (but
not *implemented*!) in <asm-generic/syscall.h>, and what this patch is
intended to do is to make the x86-specific code follow.
As everyone either does or should know, treating the same data in
different ways in different flows is a security hole just waiting to
happen.
Most of these are relatively recently introduced bugs in x86-64; the
original assembly code version zero-extended %rax, compared it against
the length of the system call table, and would unconditionally return
-ENOSYS otherwise. Then *at least* on two separate occasions someone
"optimized" it by removing "movl %eax, %eax" not understanding that this
is not a noop in x86-64 but a zero-extend (perhaps gas ought to be able
to allow movzlq as an alias...) introducing a critical security bug
which was one of the motivators for the SMAP CPU feature.
On x86-64, the ABI is that the callee is responsible for extending
parameters, so only examining the lower 32 bits is fully consistent with
any "int" argument to any system call, e.g. regs->di for write(2).
Andy L. and I had a fairly long discussion about this, and some of this
was updated after the first RFC, but I fully agree I'm not capturing it
all that well. I hope the above points clear things up better and I'll
rewrite this into a better patch description.
-hpa