Re: [PATCH v7 1/6] seccomp: add a return code to trap to userspace
From: Tycho Andersen
Date: Thu Sep 27 2018 - 19:04:18 EST
On Thu, Sep 27, 2018 at 11:51:40PM +0200, Jann Horn wrote:
> +Christoph Hellwig, Al Viro, fsdevel: For two questions about the poll
> interface (search for "seccomp_notify_poll" and
> "seccomp_notify_release" in the patch)
>
> @Tycho: FYI, I've gone through all of v7 now, apart from the
> test/sample code. So don't wait for more comments from me before
> sending out v8.
(assuming you meant v8 -> v9) yes thanks for your reviews! Much
appreciated.
> On Thu, Sep 27, 2018 at 5:11 PM Tycho Andersen <tycho@xxxxxxxx> wrote:
> > This patch introduces a means for syscalls matched in seccomp to notify
> > some other task that a particular filter has been triggered.
> >
> > The motivation for this is primarily for use with containers. For example,
> > if a container does an init_module(), we obviously don't want to load this
> > untrusted code, which may be compiled for the wrong version of the kernel
> > anyway. Instead, we could parse the module image, figure out which module
> > the container is trying to load and load it on the host.
> >
> > As another example, containers cannot mknod(), since this checks
> > capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> > /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> > coding some whitelist in the kernel. Another example is mount(), which has
> > many security restrictions for good reason, but configuration or runtime
> > knowledge could potentially be used to relax these restrictions.
>
> Note that in that case, the trusted runtime needs to be in the same
> mount namespace as the container. mount() doesn't work on the mount
> structure of a foreign mount namespace; check_mnt() specifically
> checks for this case, and I think pretty much everything in
> sys_mount() uses that check. So you'd have to join the container's
> mount namespace before forwarding a mount syscall.
Yep, Serge came up with a pretty neat trick that we used in LXD to
accomplish sending mounts to containers, but it requires some
coordination up front.
> > This patch adds functionality that is already possible via at least two
> > other means that I know about, both of which involve ptrace(): first, one
> > could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
> > Unfortunately this is slow, so a faster version would be to install a
> > filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
> > Since ptrace allows only one tracer, if the container runtime is that
> > tracer, users inside the container (or outside) trying to debug it will not
> > be able to use ptrace, which is annoying. It also means that older
> > distributions based on Upstart cannot boot inside containers using ptrace,
> > since upstart itself uses ptrace to start services.
> >
> > The actual implementation of this is fairly small, although getting the
> > synchronization right was/is slightly complex.
> >
> > Finally, it's worth noting that the classic seccomp TOCTOU of reading
> > memory data from the task still applies here,
>
> Actually, it doesn't, right? It would apply if you told the kernel "go
> ahead, that syscall is fine", but that's not how the API works - you
> always intercept the syscall, copy argument data to a trusted tracer,
> and then the tracer can make a replacement syscall. Sounds fine to me.
Right, I guess the point here is just "you need to copy all the data
to the tracer *before* making a policy decision".
> > but can be avoided with
> > careful design of the userspace handler: if the userspace handler reads all
> > of the task memory that is necessary before applying its security policy,
> > the tracee's subsequent memory edits will not be read by the tracer.
> [...]
> > diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
> [...]
> > +which (on success) will return a listener fd for the filter, which can then be
> > +passed around via ``SCM_RIGHTS`` or similar. Alternatively, a filter fd can be
> > +acquired via:
> > +
> > +.. code-block::
> > +
> > + fd = ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid, 0);
>
> The manpage documents ptrace() as taking four arguments, not three. I
> know that the header defines it with varargs, but it would probably be
> more useful to require passing in zero as the fourth argument so that
> we have a place to stick flags if necessary in the future.
Yep, I'll fix this, thanks. But also this documentation should really
live in the seccomp patch; some rebase got screwed up somewhere.
> > +which grabs the 0th filter for some task which the tracer has privilege over.
> > +Note that filter fds correspond to a particular filter, and not a particular
> > +task. So if this task then forks, notifications from both tasks will appear on
> > +the same filter fd. Reads and writes to/from a filter fd are also synchronized,
> > +so a filter fd can safely have many readers.
>
> Add a note about needing CAP_SYS_ADMIN here? Also, might be useful to
> clarify in which direction "nth filter" counts.
Will do.
> > +The interface for a seccomp notification fd consists of two structures:
> > +
> > +.. code-block::
> > +
> > + struct seccomp_notif {
> > + __u16 len;
> > + __u64 id;
> > + pid_t pid;
> > + __u8 signalled;
> > + struct seccomp_data data;
> > + };
> > +
> > + struct seccomp_notif_resp {
> > + __u16 len;
> > + __u64 id;
> > + __s32 error;
> > + __s64 val;
> > + };
> > +
> > +Users can read via ``ioctl(SECCOMP_NOTIF_RECV)`` (or ``poll()``) on a seccomp
> > +notification fd to receive a ``struct seccomp_notif``, which contains five
> > +members: the input length of the structure, a unique-per-filter ``id``, the
> > +``pid`` of the task which triggered this request (which may be 0 if the task is
> > +in a pid ns not visible from the listener's pid namespace), a flag representing
> > +whether or not the notification is a result of a non-fatal signal, and the
> > +``data`` passed to seccomp. Userspace can then make a decision based on this
> > +information about what to do, and ``ioctl(SECCOMP_NOTIF_SEND)`` a response,
> > +indicating what should be returned to userspace. The ``id`` member of ``struct
> > +seccomp_notif_resp`` should be the same ``id`` as in ``struct seccomp_notif``.
> > +
> > +It is worth noting that ``struct seccomp_data`` contains the values of register
> > +arguments to the syscall, but does not contain pointers to memory. The task's
> > +memory is accessible to suitably privileged traces via ``ptrace()`` or
> > +``/proc/pid/map_files/``.
>
> You probably don't actually want to use /proc/pid/map_files here; you
> can't use that to access anonymous memory, and it needs CAP_SYS_ADMIN.
> And while reading memory via ptrace() is possible, the interface is
> really ugly (e.g. you can only read data in 4-byte chunks), and your
> caveat about locking out other ptracers (or getting locked out by
> them) applies. I'm not even sure if you could read memory via ptrace
> while a process is stopped in the seccomp logic? PTRACE_PEEKDATA
> requires the target to be in a __TASK_TRACED state.
> The two interfaces you might want to use instead are /proc/$pid/mem
> and process_vm_{readv,writev}, which allow you to do nice,
> arbitrarily-sized, vectored IO on the memory of another process.
Yes, in fact the sample code does use /proc/$pid/mem, but the docs
should be correct :)
> > + /*
> > + * Here it's possible we got a signal and then had to wait on the mutex
> > + * while the reply was sent, so let's be sure there wasn't a response
> > + * in the meantime.
> > + */
> > + if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) {
> > + /*
> > + * We got a signal. Let's tell userspace about it (potentially
> > + * again, if we had already notified them about the first one).
> > + */
> > + n.signaled = true;
> > + if (n.state == SECCOMP_NOTIFY_SENT) {
> > + n.state = SECCOMP_NOTIFY_INIT;
> > + up(&match->notif->request);
> > + }
>
> Do you need another wake_up_poll() here?
Yes! Good point.
> > + mutex_unlock(&match->notify_lock);
> > + err = wait_for_completion_killable(&n.ready);
> > + mutex_lock(&match->notify_lock);
> > + if (err < 0)
> > + goto remove_list;
>
> Add a comment here explaining that we intentionally leave the
> semaphore count too high (because otherwise we'd have to block), and
> seccomp_notify_recv() compensates for that?
Will do.
> > + }
> > +
> > + ret = n.val;
> > + err = n.error;
> > +
> > +remove_list:
> > + list_del(&n.list);
> > +out:
> > + mutex_unlock(&match->notify_lock);
> > + syscall_set_return_value(current, task_pt_regs(current),
> > + err, ret);
> > +}
> > +
> > static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
> > const bool recheck_after_trace)
> > {
> [...]
> > #ifdef CONFIG_SECCOMP_FILTER
> > +static struct file *init_listener(struct task_struct *,
> > + struct seccomp_filter *);
> > +
> > /**
> > * seccomp_set_mode_filter: internal function for setting seccomp filter
> > * @flags: flags to change filter behavior
> > @@ -853,6 +1000,8 @@ static long seccomp_set_mode_filter(unsigned int flags,
> > const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
> > struct seccomp_filter *prepared = NULL;
> > long ret = -EINVAL;
> > + int listener = 0;
> > + struct file *listener_f = NULL;
> >
> > /* Validate flags. */
> > if (flags & ~SECCOMP_FILTER_FLAG_MASK)
> > @@ -863,13 +1012,28 @@ static long seccomp_set_mode_filter(unsigned int flags,
> > if (IS_ERR(prepared))
> > return PTR_ERR(prepared);
> >
> > + if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
> > + listener = get_unused_fd_flags(0);
> > + if (listener < 0) {
> > + ret = listener;
> > + goto out_free;
> > + }
> > +
> > + listener_f = init_listener(current, prepared);
> > + if (IS_ERR(listener_f)) {
> > + put_unused_fd(listener);
> > + ret = PTR_ERR(listener_f);
> > + goto out_free;
> > + }
> > + }
> > +
> > /*
> > * Make sure we cannot change seccomp or nnp state via TSYNC
> > * while another thread is in the middle of calling exec.
> > */
> > if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
> > mutex_lock_killable(¤t->signal->cred_guard_mutex))
> > - goto out_free;
> > + goto out_put_fd;
> >
> > spin_lock_irq(¤t->sighand->siglock);
> >
> > @@ -887,6 +1051,16 @@ static long seccomp_set_mode_filter(unsigned int flags,
> > spin_unlock_irq(¤t->sighand->siglock);
> > if (flags & SECCOMP_FILTER_FLAG_TSYNC)
> > mutex_unlock(¤t->signal->cred_guard_mutex);
> > +out_put_fd:
> > + if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
> > + if (ret < 0) {
> > + fput(listener_f);
> > + put_unused_fd(listener);
> > + } else {
> > + fd_install(listener, listener_f);
> > + ret = listener;
> > + }
> > + }
> > out_free:
> > seccomp_filter_free(prepared);
> > return ret;
> [...]
> > +
> > +#ifdef CONFIG_SECCOMP_FILTER
> > +static int seccomp_notify_release(struct inode *inode, struct file *file)
> > +{
> > + struct seccomp_filter *filter = file->private_data;
> > + struct seccomp_knotif *knotif;
> > +
> > + mutex_lock(&filter->notify_lock);
> > +
> > + /*
> > + * If this file is being closed because e.g. the task who owned it
> > + * died, let's wake everyone up who was waiting on us.
> > + */
> > + list_for_each_entry(knotif, &filter->notif->notifications, list) {
> > + if (knotif->state == SECCOMP_NOTIFY_REPLIED)
> > + continue;
> > +
> > + knotif->state = SECCOMP_NOTIFY_REPLIED;
> > + knotif->error = -ENOSYS;
> > + knotif->val = 0;
> > +
> > + complete(&knotif->ready);
> > + }
> > +
> > + wake_up_all(&filter->notif->wqh);
>
> If select() is polling us, a reference to the open file is being held,
> and this can't be reached; and I think if epoll is polling us,
> eventpoll_release() will remove itself from the wait queue, right? So
> can this wake_up_all() actually ever notify anyone?
I don't know actually, I just thought better safe than sorry. I can
drop it, though.
> > + kfree(filter->notif);
> > + filter->notif = NULL;
> > + mutex_unlock(&filter->notify_lock);
> > + __put_seccomp_filter(filter);
> > + return 0;
> > +}
> > +
> > +static long seccomp_notify_recv(struct seccomp_filter *filter,
> > + unsigned long arg)
> > +{
> > + struct seccomp_knotif *knotif = NULL, *cur;
> > + struct seccomp_notif unotif = {};
> > + ssize_t ret;
> > + u16 size;
> > + void __user *buf = (void __user *)arg;
> > +
> > + if (copy_from_user(&size, buf, sizeof(size)))
> > + return -EFAULT;
> > +
> > + ret = down_interruptible(&filter->notif->request);
> > + if (ret < 0)
> > + return ret;
> > +
> > + mutex_lock(&filter->notify_lock);
> > + list_for_each_entry(cur, &filter->notif->notifications, list) {
> > + if (cur->state == SECCOMP_NOTIFY_INIT) {
> > + knotif = cur;
> > + break;
> > + }
> > + }
> > +
> > + /*
> > + * If we didn't find a notification, it could be that the task was
> > + * interrupted between the time we were woken and when we were able to
>
> s/interrupted/interrupted by a fatal signal/ ?
>
> > + * acquire the rw lock.
>
> State more explicitly here that we are compensating for an incorrectly
> high semaphore count?
Will do, thanks.
> > + */
> > + if (!knotif) {
> > + ret = -ENOENT;
> > + goto out;
> > + }
> > +
> > + size = min_t(size_t, size, sizeof(unotif));
> > +
> > + unotif.len = size;
> > + unotif.id = knotif->id;
> > + unotif.pid = task_pid_vnr(knotif->task);
> > + unotif.signaled = knotif->signaled;
> > + unotif.data = *(knotif->data);
> > +
> > + if (copy_to_user(buf, &unotif, size)) {
> > + ret = -EFAULT;
> > + goto out;
> > + }
> > +
> > + ret = size;
> > + knotif->state = SECCOMP_NOTIFY_SENT;
> > + wake_up_poll(&filter->notif->wqh, EPOLLOUT | EPOLLWRNORM);
> > +
> > +
> > +out:
> > + mutex_unlock(&filter->notify_lock);
> > + return ret;
> > +}
> > +
> > +static long seccomp_notify_send(struct seccomp_filter *filter,
> > + unsigned long arg)
> > +{
> > + struct seccomp_notif_resp resp = {};
> > + struct seccomp_knotif *knotif = NULL;
> > + long ret;
> > + u16 size;
> > + void __user *buf = (void __user *)arg;
> > +
> > + if (copy_from_user(&size, buf, sizeof(size)))
> > + return -EFAULT;
> > + size = min_t(size_t, size, sizeof(resp));
> > + if (copy_from_user(&resp, buf, size))
> > + return -EFAULT;
> > +
> > + ret = mutex_lock_interruptible(&filter->notify_lock);
> > + if (ret < 0)
> > + return ret;
> > +
> > + list_for_each_entry(knotif, &filter->notif->notifications, list) {
> > + if (knotif->id == resp.id)
> > + break;
> > + }
> > +
> > + if (!knotif || knotif->id != resp.id) {
>
> Uuuh, this looks unsafe and wrong. I don't think `knotif` can ever be
> NULL here. If `filter->notif->notifications` is empty, I think
> `knotif` will be `container_of(&filter->notif->notifications, struct
> seccom_knotif, list)` - in other words, you'll have a type confusion,
> and `knotif` probably points into some random memory in front of
> `filter->notif`.
>
> Am I missing something?
No, I just flubbed the list API.
> > + ret = -ENOENT;
> > + goto out;
> > + }
> > +
> > + /* Allow exactly one reply. */
> > + if (knotif->state != SECCOMP_NOTIFY_SENT) {
> > + ret = -EINPROGRESS;
> > + goto out;
> > + }
>
> This means that if seccomp_do_user_notification() has in the meantime
> received a signal and transitioned from SENT back to INIT, this will
> fail, right? So we fail here, then we read the new notification, and
> then we can retry SECCOMP_NOTIF_SEND? Is that intended?
I think so, the idea being that you might want to do something
different if a signal was sent. But Andy seemed to think that we might
not actually do anything different.
Either way, for the case you describe, EINPROGRESS is a little weird.
Perhaps it should be:
if (knotif->state == SECCOMP_NOTIFY_INIT) {
ret = -EBUSY; /* or something? */
goto out;
} else if (knotif->state == SECCOMP_NOTIFY_REPLIED) {
ret = -EINPROGRESS;
goto out;
}
?
> > + ret = size;
> > + knotif->state = SECCOMP_NOTIFY_REPLIED;
> > + knotif->error = resp.error;
> > + knotif->val = resp.val;
> > + complete(&knotif->ready);
> > +out:
> > + mutex_unlock(&filter->notify_lock);
> > + return ret;
> > +}
> > +
> > +static long seccomp_notify_id_valid(struct seccomp_filter *filter,
> > + unsigned long arg)
> > +{
> > + struct seccomp_knotif *knotif = NULL;
> > + void __user *buf = (void __user *)arg;
> > + u64 id;
> > + long ret;
> > +
> > + if (copy_from_user(&id, buf, sizeof(id)))
> > + return -EFAULT;
> > +
> > + ret = mutex_lock_interruptible(&filter->notify_lock);
> > + if (ret < 0)
> > + return ret;
> > +
> > + ret = -1;
>
> In strace, this is going to show up as EPERM. Maybe use something like
> -ENOENT instead? Or whatever you think resembles a fitting error
> number.
Yep, will do.
> > + list_for_each_entry(knotif, &filter->notif->notifications, list) {
> > + if (knotif->id == id) {
> > + ret = 0;
>
> Would it make sense to treat notifications that have already been
> replied to as invalid?
I suppose so, since we aren't going to let you reply to them anyway.
> > + goto out;
> > + }
> > + }
> > +
> > +out:
> > + mutex_unlock(&filter->notify_lock);
> > + return ret;
> > +}
> > +
> [...]
> > +static __poll_t seccomp_notify_poll(struct file *file,
> > + struct poll_table_struct *poll_tab)
> > +{
> > + struct seccomp_filter *filter = file->private_data;
> > + __poll_t ret = 0;
> > + struct seccomp_knotif *cur;
> > +
> > + poll_wait(file, &filter->notif->wqh, poll_tab);
> > +
> > + ret = mutex_lock_interruptible(&filter->notify_lock);
> > + if (ret < 0)
> > + return ret;
>
> Looking at the callers of vfs_poll(), as far as I can tell, a poll
> handler is not allowed to return error codes. Perhaps someone who
> knows the poll interface better can weigh in here. I've CCed some
> people who should hopefully know better how this stuff works.
Thanks.
> > + list_for_each_entry(cur, &filter->notif->notifications, list) {
> > + if (cur->state == SECCOMP_NOTIFY_INIT)
> > + ret |= EPOLLIN | EPOLLRDNORM;
> > + if (cur->state == SECCOMP_NOTIFY_SENT)
> > + ret |= EPOLLOUT | EPOLLWRNORM;
> > + if (ret & EPOLLIN && ret & EPOLLOUT)
> > + break;
> > + }
> > +
> > + mutex_unlock(&filter->notify_lock);
> > +
> > + return ret;
> > +}
> > +
> > +static const struct file_operations seccomp_notify_ops = {
> > + .poll = seccomp_notify_poll,
> > + .release = seccomp_notify_release,
> > + .unlocked_ioctl = seccomp_notify_ioctl,
> > +};
> > +
> > +static struct file *init_listener(struct task_struct *task,
> > + struct seccomp_filter *filter)
> > +{
>
> Why does this function take a `task` pointer instead of always
> accessing `current`? If `task` actually wasn't `current`, I would have
> concurrency concerns. A comment in seccomp.h even explains:
>
> * @filter must only be accessed from the context of current as there
> * is no read locking.
>
> Unless there's a good reason for it, I would prefer it if this
> function didn't take a `task` pointer.
I think Kees replied already, but yes, this is a good point :(. We can
continue in his thread.
> > + struct file *ret = ERR_PTR(-EBUSY);
> > + struct seccomp_filter *cur, *last_locked = NULL;
> > + int filter_nesting = 0;
> > +
> > + for (cur = task->seccomp.filter; cur; cur = cur->prev) {
> > + mutex_lock_nested(&cur->notify_lock, filter_nesting);
> > + filter_nesting++;
> > + last_locked = cur;
> > + if (cur->notif)
> > + goto out;
> > + }
> > +
> > + ret = ERR_PTR(-ENOMEM);
> > + filter->notif = kzalloc(sizeof(*(filter->notif)), GFP_KERNEL);
>
> sizeof(struct notification) instead, to make the code clearer?
>
> > + if (!filter->notif)
> > + goto out;
> > +
> > + sema_init(&filter->notif->request, 0);
> > + INIT_LIST_HEAD(&filter->notif->notifications);
> > + filter->notif->next_id = get_random_u64();
> > + init_waitqueue_head(&filter->notif->wqh);
>
> Nit: next_id and notifications are declared in reverse order in the
> struct. Could you flip them around here?
Sure, will do.
> > + ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops,
> > + filter, O_RDWR);
> > + if (IS_ERR(ret))
> > + goto out;
> > +
> > +
> > + /* The file has a reference to it now */
> > + __get_seccomp_filter(filter);
>
> __get_seccomp_filter() has a comment in it that claims "/* Reference
> count is bounded by the number of total processes. */". I think this
> change invalidates that comment. I think it should be fine to just
> remove the comment.
Will do, thanks.
> > +out:
> > + for (cur = task->seccomp.filter; cur; cur = cur->prev) {
>
> s/; cur;/; 1;/, or use a while loop instead? If the NULL check fires
> here, something went very wrong.
I suppose so, given that we have last_locked.
Tycho