Re: [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value()

From: Christophe Leroy
Date: Thu Jan 23 2025 - 17:20:11 EST




Le 23/01/2025 à 19:28, Dmitry V. Levin a écrit :
On Mon, Jan 20, 2025 at 02:51:38PM +0100, Christophe Leroy wrote:
Le 14/01/2025 à 18:04, Dmitry V. Levin a écrit :
On Mon, Jan 13, 2025 at 06:34:44PM +0100, Christophe Leroy wrote:
Le 13/01/2025 à 18:10, Dmitry V. Levin a écrit :
Bring syscall_set_return_value() in sync with syscall_get_error(),
and let upcoming ptrace/set_syscall_info selftest pass on powerpc.

This reverts commit 1b1a3702a65c ("powerpc: Don't negate error in
syscall_set_return_value()").

There is a clear detailed explanation in that commit of why it needs to
be done.

If you think that commit is wrong you have to explain why with at least
the same level of details.

OK, please have a look whether this explanation is clear and detailed enough:

=======
powerpc: properly negate error in syscall_set_return_value()

When syscall_set_return_value() is used to set an error code, the caller
specifies it as a negative value in -ERRORCODE form.

In !trap_is_scv case the error code is traditionally stored as follows:
gpr[3] contains a positive ERRORCODE, and ccr has 0x10000000 flag set.
Here are a few examples to illustrate this convention. The first one
is from syscall_get_error():
/*
* If the system call failed,
* regs->gpr[3] contains a positive ERRORCODE.
*/
return (regs->ccr & 0x10000000UL) ? -regs->gpr[3] : 0;

The second example is from regs_return_value():
if (is_syscall_success(regs))
return regs->gpr[3];
else
return -regs->gpr[3];

The third example is from check_syscall_restart():
regs->result = -EINTR;
regs->gpr[3] = EINTR;
regs->ccr |= 0x10000000;

Compared with these examples, the failure of syscall_set_return_value()
to assign a positive ERRORCODE into regs->gpr[3] is clearly visible:
/*
* In the general case it's not obvious that we must deal with
* CCR here, as the syscall exit path will also do that for us.
* However there are some places, eg. the signal code, which
* check ccr to decide if the value in r3 is actually an error.
*/
if (error) {
regs->ccr |= 0x10000000L;
regs->gpr[3] = error;
} else {
regs->ccr &= ~0x10000000L;
regs->gpr[3] = val;
}

This fix brings syscall_set_return_value() in sync with syscall_get_error()
and lets upcoming ptrace/set_syscall_info selftest pass on powerpc.

Fixes: 1b1a3702a65c ("powerpc: Don't negate error in syscall_set_return_value()").
=======

I think there is still something going wrong.

do_seccomp() sets regs->gpr[3] = -ENOSYS; by default.

Then it calls __secure_computing() which returns what __seccomp_filter()
returns.

In case of error, __seccomp_filter() calls syscall_set_return_value()
with a negative value then returns -1

do_seccomp() is called by do_syscall_trace_enter() which returns -1 when
do_seccomp() doesn't return 0.

do_syscall_trace_enter() is called by system_call_exception() and
returns -1, so syscall_exception() returns regs->gpr[3]

In entry_32.S, transfer_to_syscall, syscall_exit_prepare() is then
called with the return of syscall_exception() as first parameter, which
leads to:

if (unlikely(r3 >= (unsigned long)-MAX_ERRNO) && is_not_scv) {
if (likely(!(ti_flags & (_TIF_NOERROR | _TIF_RESTOREALL)))) {
r3 = -r3;
regs->ccr |= 0x10000000; /* Set SO bit in CR */
}
}

By chance, because you have already changed the sign of gpr[3], the
above test fails and nothing is done to r3, and because you have also
already set regs->ccr it works.

But all this looks inconsistent with the fact that do_seccomp sets
-ENOSYS as default value

Also, when do_seccomp() returns 0, do_syscall_trace_enter() check the
syscall number and when it is wrong it goes to skip: which sets
regs->gpr[3] = -ENOSYS;

So really I think it is not in line with your changes to set positive
value in gpr[3].

Maybe your change is still correct but it needs to be handled completely
in that case.

Indeed, there is an inconsistency in !trap_is_scv case.

In some places such as syscall_get_error() and regs_return_value() the
semantics is as I described earlier: gpr[3] contains a positive ERRORCODE
and ccr has 0x10000000 flag set. This semantics is a part of the ABI and
therefore cannot be changed.

In some other places like do_seccomp() and do_syscall_trace_enter() the
semantics is similar to the trap_is_scv case: gpr[3] contains a negative
ERRORCODE and ccr is unchanged. In addition, system_call_exception()
returns the system call function return value when it is executed, and
gpr[3] otherwise. The value returned by system_call_exception() is passed
on to syscall_exit_prepare() which performs the conversion you mentioned.

What's remarkable is that in those places that are a part of the ABI the
traditional semantics is kept, while in other places the implementation
follows the trap_is_scv-like semantics, while traditional semantics is
also supported there.

The only case where I see some intersection is do_seccomp() where the
tracer would be able to see -ENOSYS in gpr[3]. However, the seccomp stop
is not the place where the tracer *reads* the system call exit status,
so whatever was written in gpr[3] before __secure_computing() is not
really relevant, consequently, selftests/seccomp/seccomp_bpf passes with
this patch applied as well as without it.

After looking at system_call_exception() I doubt this inconsistency can be
easily avoided, so I don't see how this patch could be enhanced further,
and what else could I do with the patch besides dropping it and letting
!trap_is_scv case be unsupported by PTRACE_SET_SYSCALL_INFO API, which
would be unfortunate.



To add a bit more to the confusion, a task can be flagged with TIF_NOERROR by calling force_successful_syscall_return(), in which case even if gpr[3] contains a negative between -MAX_ERRNO and -1 the syscall will be handled as successfull hence CCR[SO] won't be set. But it seems this is not handled by syscall_set_return_value(). So what will happen with time() when approaching year 2036 for instance ?