Re: [PATCH v2 2/5] seccomp: Add wait_killable semantic to seccomp user notifier

From: Sargun Dhillon
Date: Fri May 07 2021 - 03:17:46 EST

On Thu, May 6, 2021 at 4:59 PM Andy Lutomirski <luto@xxxxxxxxxx> wrote:
> On Thu, May 6, 2021 at 12:35 PM Rodrigo Campos <rodrigo@xxxxxxxxxx> wrote:
> > We want to be woken-up in seccomp_do_user_notification() so we know we
> > received a signal and notify to the supervisor about it. IMHO, we want
> > an additional "if" here[2], like (haven't checked the retun codes of
> > functions) "if err != 0" and do a "wake_up_poll(&match->wqh, EPOLLIN
> > | EPOLLRDNORM);" or POLLPRI or something, and then "goto wait". IOW,
> > we only return when the supervisor tells us to, if a signal arrives we
> > just let the supervisor know and continue waiting for a response. The
> > proper response might be "I wrote partial data, return only 3 bytes",
> > etc. This way we can properly emulate it.
> >
> > What is not clear to me, and I'd like others to comment is:
> >
> > 1. How should this new "the supervisor should handle signals" mode be enabled?
> > a. If we use an extra ioctl, as Andy suggested, I think we have a
> > race too: from the moment the seccomp syscall is issued until the new
> > ioctl is called there is a race. If the container does a syscall in
> > that window, it _can_ incorrectly return EINTR as it does now. This
> > race can be very small and the ioctl can make _all the next syscalls_
> > wait in this new mode, so maybe the race is so small that we don't
> > care in practice.
> If the ioctl is sticky, it can presumably avoid the race entirely by
> having whoever calls seccomp() immediately do a dummy syscall or
> otherwise wait for the notifier to confirm that it has done the ioctl.
> Admittedly, this may be annoying.
> > b. But if we want to really fully solve the issue, the other option
> > that comes to mind is adding a new flag for the seccomp syscall, to
> > signal that the supervisor will handle signals and should be
> > forwarded. This way, if the container does a new syscall it will be
> > put to wait in this new mode (we have the information that the new
> > mode should be used already). Something like
> I like that better.
> >
> > 2. What should we notify to the supervisor? Only that a signal arrived
> > or also which signal it was? If we do the EPOLLPRI thingy, as Andy
> > mentioned in a previous thread, IIUC we will only notify that _a_
> > signal arrived, but not _which_. To notify which signal arrived might
> > be complex, though, (not sure, I haven't explored how that should be).
> Are there any examples in the kernel of syscalls that care *which*
> signal came in? As far as I know, there are really only three states
> that matter: no signal is pending, a kill is pending, or a regular
> signal is pending. (The latter means that there's a signal that is
> unmasked, etc and will should therefore interrupt a syscall if the
> syscall can be interrupted.

I do not believe we want full preemption on all syscalls. I think we want
the ability to have "safe" (for some definition of safe) preemption on some

For example, a syscall like connect should allow for arbitrary
preemption, even if the supervisor isn't able to get to it in time (using
alarm here is a thing I've seen as a common pattern). If a connect
starts, and is then preempted, you get an EINTR. glibc will retry the
connect, and if it was already in progress, you should get an
EINPROGRESS, otherwise no error. This is "correct" behaviour,
but not all syscalls do this.

I think we can take a slightly lighterweight approach in in the filter
action, we can include a flag in the data of the action that the filter
return which signifies whether or not the call should be preemptible
before being seen by the supervisor, and another flag which indicates
that once the supervisor has RECV'd the notification, it should not
allow for interruption.

I agree though, that if the call is in the non-preemptible state we need to
have a mechanism to notify the supervisor to cancel work, and
return to the notifying process. I think that if we do the POLLPRI thing,
then there should be a RECV_SIGNAL ioctl to read POLLPRI notifications,
that indicate a particular notification has been preempted (signaled). At this
point, the supervisor can do whatever it wants.

There is also a slight other race condition:
1. Notification is generated wait non-preemptible flag in data
2. Program returns signal
3. At this point, should the kernel do POLLPRI or wait for
the supervisor to first read the notification. Should we add
some metadata to the notification indicating it was attempted
to be preempted?

Would the following work:
1. Add a two flags to the data, indicating pre-notification preemption being
allowed and return -EINTR, and once the notification is picked up by
the supervisor transitioning into an -EINTR state
2. Make it so if an in-progress notification is interrupted via a
signal, it will
send a POLLPRI, which is read via RECV_SIGNAL.