Re: [PATCH v2 2/5] seccomp: Add wait_killable semantic to seccomp user notifier

From: Rodrigo Campos
Date: Thu May 06 2021 - 15:36:22 EST

On Sat, May 1, 2021 at 2:10 AM Sargun Dhillon <sargun@xxxxxxxxx> wrote:
> On Fri, Apr 30, 2021 at 4:23 PM Andy Lutomirski <luto@xxxxxxxxxx> wrote:
> >
> > On Fri, Apr 30, 2021 at 1:49 PM Sargun Dhillon <sargun@xxxxxxxxx> wrote:
> > >
> > > The user notifier feature allows for filtering of seccomp notifications in
> > > userspace. While the user notifier is handling the syscall, the notifying
> > > process can be preempted, thus ending the notification. This has become a
> > > growing problem, as Golang has adopted signal based async preemption[1]. In
> > > this, it will preempt every 10ms, thus leaving the supervisor less than
> > > 10ms to respond to a given notification. If the syscall require I/O (mount,
> > > connect) on behalf of the process, it can easily take 10ms.
> > >
> > > This allows the supervisor to set a flag that moves the process into a
> > > state where it is only killable by terminating signals as opposed to all
> > > signals. The process can still be terminated before the supervisor receives
> > > the notification.
> >
> > This is still racy, right? If a signal arrives after the syscall
> > enters the seccomp code but before the supervisor gets around to
> > issuing the new ioctl, the syscall will erroneously return -EINTR,
> > right?
> >
> > Can we please just fully fix this instead of piling a racy partial fix
> > on top of an incorrect design?
> >
> > --Andy
> I thought that you were fine with this approach. Sorry.
> Maybe this is a dumb question, what's wrong with returning an EINTR if the
> syscall was never observed by the supervisor?

IIUC It is partially observed by the supervisor.

If you are polling on the seccomp-fd, you _will see_ the POLLIN,
right? That is sent a few lines before the wait, here[1]. And it can
happen that after seeing the POLLIN, by the time we run the recv ioctl
we hang up waiting _for a new syscall to be issued_. Because the
previous syscall could have been interrupted by a signal before we
issued the recv ioctl. IOW, we have spurious wake-ups and no way to
tell when there is something to receive or not for the recv ioctl.
Seems fishy, especially since this is supposed to be a "recv that
blocks signals" ioctl but it can be interrupted by a signal anyways if
they happen before we issue the ioctl.

Another problem is that the container/target will see EINTR on a
syscall that maybe doesn't return EINTR. The program might not handle
that case, as in fact shouldn't ever happen. This is a problem today
too, though, but it is what I understand Andy asks to improve. For
sure it is not proper syscall emulation if the emulated returns things
that should/are never be returned by a regular syscall.

Also, a third problem: if the go runtime gives like ~10ms in some
cases between signals, it is not clear how many times we will issue
the recv ioctl before the runtime sends a signal and interrupts the
container waiting. Furthermore the race is still there, just the
window is smaller.

The main point being the second issue I mentioned: it will never be
correct emulation if an emulated syscall returns different things than
a real one (and things that can't be returned at all by some
syscalls). Additionally, the emulated syscall behaviour should also be
compatible with the real, etc.

> I think that the only other reasonable design is that we add data to the existing action which makes it sleep in wait_killable state.

I don't think the wait should be doing a wait_killable under Andy's idea.

We want to be woken-up in seccomp_do_user_notification() so we know we
received a signal and notify to the supervisor about it. IMHO, we want
an additional "if" here[2], like (haven't checked the retun codes of
functions) "if err != 0" and do a "wake_up_poll(&match->wqh, EPOLLIN
| EPOLLRDNORM);" or POLLPRI or something, and then "goto wait". IOW,
we only return when the supervisor tells us to, if a signal arrives we
just let the supervisor know and continue waiting for a response. The
proper response might be "I wrote partial data, return only 3 bytes",
etc. This way we can properly emulate it.

What is not clear to me, and I'd like others to comment is:

1. How should this new "the supervisor should handle signals" mode be enabled?
a. If we use an extra ioctl, as Andy suggested, I think we have a
race too: from the moment the seccomp syscall is issued until the new
ioctl is called there is a race. If the container does a syscall in
that window, it _can_ incorrectly return EINTR as it does now. This
race can be very small and the ioctl can make _all the next syscalls_
wait in this new mode, so maybe the race is so small that we don't
care in practice.
b. But if we want to really fully solve the issue, the other option
that comes to mind is adding a new flag for the seccomp syscall, to
signal that the supervisor will handle signals and should be
forwarded. This way, if the container does a new syscall it will be
put to wait in this new mode (we have the information that the new
mode should be used already). Something like

2. What should we notify to the supervisor? Only that a signal arrived
or also which signal it was? If we do the EPOLLPRI thingy, as Andy
mentioned in a previous thread, IIUC we will only notify that _a_
signal arrived, but not _which_. To notify which signal arrived might
be complex, though, (not sure, I haven't explored how that should be).

3. How should we notify whatever we want to notify? If we only notify
that a signal arrived, that is probably simple. But if we want to
notify which signal arrived it might be tricker and would like to know
what others think about how that interface should be (adjust the
struct seccomp_notif I guess?), if there are strong opinions, etc.

I'll be away for a few weeks now, but I'll catch up with the discussion ASAP.