Re: [PATCH v2 2/5] seccomp: Add wait_killable semantic to seccomp user notifier

From: Andy Lutomirski
Date: Thu May 06 2021 - 19:59:44 EST

On Thu, May 6, 2021 at 12:35 PM Rodrigo Campos <rodrigo@xxxxxxxxxx> wrote:

> We want to be woken-up in seccomp_do_user_notification() so we know we
> received a signal and notify to the supervisor about it. IMHO, we want
> an additional "if" here[2], like (haven't checked the retun codes of
> functions) "if err != 0" and do a "wake_up_poll(&match->wqh, EPOLLIN
> | EPOLLRDNORM);" or POLLPRI or something, and then "goto wait". IOW,
> we only return when the supervisor tells us to, if a signal arrives we
> just let the supervisor know and continue waiting for a response. The
> proper response might be "I wrote partial data, return only 3 bytes",
> etc. This way we can properly emulate it.
> What is not clear to me, and I'd like others to comment is:
> 1. How should this new "the supervisor should handle signals" mode be enabled?
> a. If we use an extra ioctl, as Andy suggested, I think we have a
> race too: from the moment the seccomp syscall is issued until the new
> ioctl is called there is a race. If the container does a syscall in
> that window, it _can_ incorrectly return EINTR as it does now. This
> race can be very small and the ioctl can make _all the next syscalls_
> wait in this new mode, so maybe the race is so small that we don't
> care in practice.

If the ioctl is sticky, it can presumably avoid the race entirely by
having whoever calls seccomp() immediately do a dummy syscall or
otherwise wait for the notifier to confirm that it has done the ioctl.
Admittedly, this may be annoying.

> b. But if we want to really fully solve the issue, the other option
> that comes to mind is adding a new flag for the seccomp syscall, to
> signal that the supervisor will handle signals and should be
> forwarded. This way, if the container does a new syscall it will be
> put to wait in this new mode (we have the information that the new
> mode should be used already). Something like

I like that better.

> 2. What should we notify to the supervisor? Only that a signal arrived
> or also which signal it was? If we do the EPOLLPRI thingy, as Andy
> mentioned in a previous thread, IIUC we will only notify that _a_
> signal arrived, but not _which_. To notify which signal arrived might
> be complex, though, (not sure, I haven't explored how that should be).

Are there any examples in the kernel of syscalls that care *which*
signal came in? As far as I know, there are really only three states
that matter: no signal is pending, a kill is pending, or a regular
signal is pending. (The latter means that there's a signal that is
unmasked, etc and will should therefore interrupt a syscall if the
syscall can be interrupted.