Re: rseq with syscall as the last instruction

From: Peter Zijlstra
Date: Thu Sep 30 2021 - 10:02:58 EST


On Tue, Sep 28, 2021 at 11:09:24AM +0200, Dmitry Vyukov wrote:
> Hi rseq maintainers,
>
> I wonder if rseq can be used in the following scenario (or extended to be used).
> I want to pass extra arguments to syscalls using a kind of
> side-channel, for example, to say "do fault injection for the next
> system call", or "trace the next system call". But what is "next"
> system call should be atomic with respect to signals.
> Let's say there is shared per-task memory location known to the kernel
> where these arguments can be stored:
>
> __thread struct trace_descriptor desk;
> prctl(REGISTER_PER_TASK_TRACE_DESCRIPTOR, &desk);
>
> then before a system call I can setup the descriptor to enable tracing:
>
> desk = ...
> SYSCALL;
>
> The problem is that if a signal arrives in between we setup desk and
> SYSCALL instruction, we will actually trace some unrelated syscall in
> the signal handler.
> Potentially the kernel could switch/restore 'desk' around syscall
> delivery, but it becomes tricky/impossible for signal handlers that do
> longjmp or mess with PC in other ways; and also would require
> extending ucontext to include the desc information (not sure if it's
> feasible).
>
> So instead the idea is to protect this sequence with rseq that will be
> restarted on signal delivery:
>
> enter rseq critical section with end right after SYSCALL instruction;
> desk = ...
> SYSCALL;
>
> Then, the kernel can simply clear 'desc', on syscall delivery.
>
> rseq docs seem to suggest that this can work:
>
> https://lwn.net/Articles/774098/
> +Restartable sequences are atomic with respect to preemption (making it
> +atomic with respect to other threads running on the same CPU), as well
> +as signal delivery (user-space execution contexts nested over the same
> +thread). They either complete atomically with respect to preemption on
> +the current CPU and signal delivery, or they are aborted.
>
> But the doc also says that the sequence must not do syscalls:
>
> +Restartable sequences must not perform system calls. Doing so may result
> +in termination of the process by a segmentation fault.
>
> The question is:
> Can this restriction be weakened to allow syscalls as the last instruction?
> For flags in this case we would pass
> RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT and
> RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE, but no
> RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL.
>
> I don't see any fundamental reasons why this couldn't work b/c if we
> restart only on signals, then once we reach the syscall, rseq critical
> section is committed, right?
>
> Do you have any feeling of how hard it would be to support or if there
> can be some implementation issues?

IIRC the only enforcement of this constraint is rseq_syscall() (which is
a NOP when !CONFIG_DEBUG_RSEQ, because performance).

However, since we use regs->ip, which for SYSCALL points to right
*after* the SYSCALL instruction (for obvious reasons), it will not in
fact match in_rseq_cs().

And as such, I think your scheme should just work as is. Did you try?