Re: pt_regs->ax == -ENOSYS

From: H. Peter Anvin
Date: Tue Apr 27 2021 - 20:24:18 EST

On 4/27/21 5:11 PM, Andy Lutomirski wrote:
On Tue, Apr 27, 2021 at 5:05 PM H. Peter Anvin <hpa@xxxxxxxxx> wrote:

On 4/27/21 4:23 PM, Andy Lutomirski wrote:

I much prefer the model of saying that the bits that make sense for
the syscall type (all 64 for 64-bit SYSCALL and the low 32 for
everything else) are all valid. This way there are no weird reserved
bits, no weird ptrace() interactions, etc. I'm a tiny bit concerned
that this would result in a backwards compatibility issue, but not
very. This would involve changing syscall_get_nr(), but that doesn't
seem so bad. The biggest problem is that seccomp hardcoded syscall
nrs to 32 bit.

An alternative would be to declare that we always truncate to 32 bits,
except that 64-bit SYSCALL with high bits set is an error and results
in ENOSYS. The ptrace interaction there is potentially nasty.

Basically, all choices here kind of suck, and I haven't done a real
analysis of all the issues...

OK, I really don't understand this. The *current* way of doing it causes
a bunch of ugly corner conditions, including in ptrace, which this would
get rid of. It isn't any different than passing any other argument which
is an int -- in fact we have this whole machinery to deal with that subcase.

Let's suppose we decide to truncate the syscall nr. What would the
actual semantics be? Would ptrace see the truncated value in orig_ax?
How about syscall user dispatch? What happens if ptrace writes a
value with high bits set to orig_ax? Do we truncate it again? Or do
we say that ptrace *can't* write too large a value?

For better for worse, RAX is 64 bits, orig_ax is a 64-bit field, and
it currently has nonsensical semantics. Redefining orig_ax as a
32-bit field is surely possible, but doing so cleanly is not
necessarily any easier than any other approach. If it weren't for
seccomp, I would say that the obviously correct answer is to just
treat it everywhere as a 64-bit number.

We *used* to truncate the system call number; that was unsigned. It causes massive headache to ptrace if a 32-bit ptrace wants to write -1, which is a bit hacky.

I would personally like to see orig_ax to be the register passed in and for the truncation to happen by syscall_get_nr().

I also note that kernel/seccomp.c and the tracing infrastructure all expect a signed int as the system call number. Yes, orig_ax is a 64-bit field, but so are the other register fields which doesn't necessarily directly reflect the value of an argument -- like, say, %rdi in the case of sys_write - it is an int argument so it gets sign extended; this is *not* reflected in ptrace.