Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space

From: Andrei Vagin
Date: Wed Apr 14 2021 - 18:12:54 EST


On Wed, Apr 14, 2021 at 08:46:40AM +0200, Jann Horn wrote:
> On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin <avagin@xxxxxxxxx> wrote:
> > We already have process_vm_readv and process_vm_writev to read and write
> > to a process memory faster than we can do this with ptrace. And now it
> > is time for process_vm_exec that allows executing code in an address
> > space of another process. We can do this with ptrace but it is much
> > slower.
> >
> > = Use-cases =
>
> It seems to me like your proposed API doesn't really fit either one of
> those usecases well...

We definitely can invent more specific interfaces for each of these
problems. Sure, they will handle their use-cases a bit better than this
generic one. But do we want to have two very specific interfaces with
separate kernel implementations? My previous experiences showed that the
kernel community doesn't like interfaces that are specific for only one
narrow use-case.

So when I was working on process_vm_exec, I was thinking how to make
one interfaces that will be good enough for all these use-cases.

>
> > Here are two known use-cases. The first one is “application kernel”
> > sandboxes like User-mode Linux and gVisor. In this case, we have a
> > process that runs the sandbox kernel and a set of stub processes that
> > are used to manage guest address spaces. Guest code is executed in the
> > context of stub processes but all system calls are intercepted and
> > handled in the sandbox kernel. Right now, these sort of sandboxes use
> > PTRACE_SYSEMU to trap system calls, but the process_vm_exec can
> > significantly speed them up.
>
> In this case, since you really only want an mm_struct to run code
> under, it seems weird to create a whole task with its own PID and so
> on. It seems to me like something similar to the /dev/kvm API would be
> more appropriate here? Implementation options that I see for that
> would be:
>
> 1. mm_struct-based:
> a set of syscalls to create a new mm_struct,
> change memory mappings under that mm_struct, and switch to it
> 2. pagetable-mirroring-based:
> like /dev/kvm, an API to create a new pagetable, mirror parts of
> the mm_struct's pagetables over into it with modified permissions
> (like KVM_SET_USER_MEMORY_REGION),
> and run code under that context.
> page fault handling would first handle the fault against mm->pgd
> as normal, then mirror the PTE over into the secondary pagetables.
> invalidation could be handled with MMU notifiers.

We are ready to discuss this sort of interfaces if the community will
agree to accept it. Are there any other users except sandboxes that will
need something like this? Will the sandbox use-case enough to justify
the addition of this interface?

>
> > Another use-case is CRIU (Checkpoint/Restore in User-space). Several
> > process properties can be received only from the process itself. Right
> > now, we use a parasite code that is injected into the process. We do
> > this with ptrace but it is slow, unsafe, and tricky.
>
> But this API will only let you run code under the *mm* of the target
> process, not fully in the context of a target *task*, right? So you
> still won't be able to use this for accessing anything other than
> memory? That doesn't seem very generically useful to me.

You are right, this will not rid us of the need to run a parasite code.
I wrote that it will make a process of injecting a parasite code a bit
simpler.

>
> Also, I don't doubt that anything involving ptrace is kinda tricky,
> but it would be nice to have some more detail on what exactly makes
> this slow, unsafe and tricky. Are there API additions for ptrace that
> would make this work better? I imagine you're thinking of things like
> an API for injecting a syscall into the target process without having
> to first somehow find an existing SYSCALL instruction in the target
> process?


You describe the first problem right. We need to find or inject a
syscall instruction to a target process.
Right now, we need to do these steps to execute a system call:

* inject the syscall instruction (PTRACE_PEEKDATA/PTRACE_POKEDATA).
* get origin registers
* set new registers
* get a signal mask.
* block signals
* resume the process
* stop it on the next syscall-exit
* get registers
* set origin registers
* restore a signal mask.

One of the CRIU principals is to avoid changing a process state, so if
criu is interrupted, processes must be resumed and continue running. The
procedure of injecting a system call creates a window when a process is
in an inconsistent state, and a disappearing CRIU at such moments will
be fatal for the process. We don't think that we can eliminate such
windows, but we want to make them smaller.

In CRIU, we have a self-healed parasite. The idea is to inject a
parasite code with a signal frame that contains the origin process
state. The parasite runs in an "RPC daemon mode" and gets commands from
criu via a unix socket. If it detects that criu disappeared, it calls
rt_sigreturn and resumes the origin process.

As for the performance of the ptrace, there are a few reasons why it is
slow. First, it is a number of steps what we need to do. Second, it is
two synchronious context switches. Even if we will solve the first
problem with a new ptrace command, it will be not enough to stop using a
parasite in CRIU.

>
> > process_vm_exec can
> > simplify the process of injecting a parasite code and it will allow
> > pre-dump memory without stopping processes. The pre-dump here is when we
> > enable a memory tracker and dump the memory while a process is continue
> > running. On each interaction we dump memory that has been changed from
> > the previous iteration. In the final step, we will stop processes and
> > dump their full state. Right now the most effective way to dump process
> > memory is to create a set of pipes and splice memory into these pipes
> > from the parasite code. With process_vm_exec, we will be able to call
> > vmsplice directly. It means that we will not need to stop a process to
> > inject the parasite code.
>
> Alternatively you could add splice support to /proc/$pid/mem or add a
> syscall similar to process_vm_readv() that splices into a pipe, right?

We send patches to introcude process_vm_splice:
https://lore.kernel.org/patchwork/cover/871116/

but they were not merged and the main reason was a lack of enough users
to justify its addition.