Re: For review: seccomp_user_notif(2) manual page [v2]

From: Jann Horn
Date: Thu Oct 29 2020 - 03:41:19 EST


On Thu, Oct 29, 2020 at 3:04 AM Tycho Andersen <tycho@tycho.pizza> wrote:
> On Thu, Oct 29, 2020 at 02:42:58AM +0100, Jann Horn wrote:
> > On Mon, Oct 26, 2020 at 10:55 AM Michael Kerrisk (man-pages)
> > <mtk.manpages@xxxxxxxxx> wrote:
> > > static bool
> > > getTargetPathname(struct seccomp_notif *req, int notifyFd,
> > > char *path, size_t len)
> > > {
> > > char procMemPath[PATH_MAX];
> > >
> > > snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", req->pid);
> > >
> > > int procMemFd = open(procMemPath, O_RDONLY);
> > > if (procMemFd == -1)
> > > errExit("\tS: open");
> > >
> > > /* Check that the process whose info we are accessing is still alive.
> > > If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed
> > > in checkNotificationIdIsValid()) succeeds, we know that the
> > > /proc/PID/mem file descriptor that we opened corresponds to the
> > > process for which we received a notification. If that process
> > > subsequently terminates, then read() on that file descriptor
> > > will return 0 (EOF). */
> > >
> > > checkNotificationIdIsValid(notifyFd, req->id);
> > >
> > > /* Read bytes at the location containing the pathname argument
> > > (i.e., the first argument) of the mkdir(2) call */
> > >
> > > ssize_t nread = pread(procMemFd, path, len, req->data.args[0]);
> > > if (nread == -1)
> > > errExit("pread");
> >
> > As discussed at
> > <https://lore.kernel.org/r/CAG48ez0m4Y24ZBZCh+Tf4ORMm9_q4n7VOzpGjwGF7_Fe8EQH=Q@xxxxxxxxxxxxxx>,
> > we need to re-check checkNotificationIdIsValid() after reading remote
> > memory but before using the read value in any way. Otherwise, the
> > syscall could in the meantime get interrupted by a signal handler, the
> > signal handler could return, and then the function that performed the
> > syscall could free() allocations or return (thereby freeing buffers on
> > the stack).
> >
> > In essence, this pread() is (unavoidably) a potential use-after-free
> > read; and to make that not have any security impact, we need to check
> > whether UAF read occurred before using the read value. This should
> > probably be called out elsewhere in the manpage, too...
> >
> > Now, of course, **reading** is the easy case. The difficult case is if
> > we have to **write** to the remote process... because then we can't
> > play games like that. If we write data to a freed pointer, we're
> > screwed, that's it. (And for somewhat unrelated bonus fun, consider
> > that /proc/$pid/mem is originally intended for process debugging,
> > including installing breakpoints, and will therefore happily write
> > over "readonly" private mappings, such as typical mappings of
> > executable code.)
> >
> > So, uuuuh... I guess if anyone wants to actually write memory back to
> > the target process, we'd better come up with some dedicated API for
> > that, using an ioctl on the seccomp fd that magically freezes the
>
> By freeze here you mean a killable wait instead of an interruptible
> wait, right?

Nope, nonkillable.

Consider the case of vfork(), where a target process does something like this:

void spawn_executable(char **argv, char **envv) {
pid_t child = vfork();
if (child == 0) {
char path[1000];
sprintf(path, ...);
execve(path, argv, envv);
}
}

and the seccomp notifier wants to look at the execve() path (as a
somewhat silly example). The child process is just borrowing the
parent's stack, and as soon as the child either gets far enough into
execve() or dies, the parent continues using that stack. So keeping
the child in killable sleep would not be enough to prevent reuse of
the child's stack.


But conceptually that's not really a big problem - we already have a
way to force the target task to stay inside the seccomp code no matter
if it gets SIGKILLed or whatever, and that is to take the notify_lock.
When the target task wakes up and wants to continue executing, it has
to first get through mutex_lock(&match->notify_lock) - and that will
always block until the lock is free. So we could e.g. do something
like:

- Grab references to the source pages in the supervisor's address
space with get_user_pages_fast().
- Take mmap_sem on the target.
- Grab references to pages in the relevant range with pin_user_pages_remote().
- Drop the mmap_sem.
- Take the notify_lock.
- Recheck whether the notification with the right ID is still there.
- Copy data from the pinned source pages to the pinned target pages.
- Drop the notify_lock.
- Drop the page references.

and this way we would still guarantee that the target process would
only be blocked in noninterruptible sleep for a small amount of time
(and would not be indirectly blocked on sleeping operations through
the mutex). It'd be pretty straightforward, I think. But as long as we
don't actually need it, it might be easier to just note in the manpage
that this is not currently supported.