Re: For review: seccomp_user_notif(2) manual page [v2]

From: Sargun Dhillon
Date: Fri Oct 30 2020 - 16:31:08 EST


On Thu, Oct 29, 2020 at 09:37:21PM +0100, Michael Kerrisk (man-pages) wrote:
> Hello Sargun,,
>
> On 10/29/20 9:53 AM, Sargun Dhillon wrote:
> > On Mon, Oct 26, 2020 at 10:55:04AM +0100, Michael Kerrisk (man-pages) wrote:
>
> [...]
>
> >> ioctl(2) operations
> >> The following ioctl(2) operations are provided to support seccomp
> >> user-space notification. For each of these operations, the first
> >> (file descriptor) argument of ioctl(2) is the listening file
> >> descriptor returned by a call to seccomp(2) with the
> >> SECCOMP_FILTER_FLAG_NEW_LISTENER flag.
> >>
> >> SECCOMP_IOCTL_NOTIF_RECV
> >> This operation is used to obtain a user-space notification
> >> event. If no such event is currently pending, the
> >> operation blocks until an event occurs. The third
> >> ioctl(2) argument is a pointer to a structure of the
> >> following form which contains information about the event.
> >> This structure must be zeroed out before the call.
> >>
> >> struct seccomp_notif {
> >> __u64 id; /* Cookie */
> >> __u32 pid; /* TID of target thread */
> >> __u32 flags; /* Currently unused (0) */
> >> struct seccomp_data data; /* See seccomp(2) */
> >> };
> >>
> >> The fields in this structure are as follows:
> >>
> >> id This is a cookie for the notification. Each such
> >> cookie is guaranteed to be unique for the
> >> corresponding seccomp filter.
> >>
> >> · It can be used with the
> >> SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation
> >> to verify that the target is still alive.
> >>
> >> · When returning a notification response to the
> >> kernel, the supervisor must include the cookie
> >> value in the seccomp_notif_resp structure that is
> >> specified as the argument of the
> >> SECCOMP_IOCTL_NOTIF_SEND operation.
> >>
> >> pid This is the thread ID of the target thread that
> >> triggered the notification event.
> >>
> >> flags This is a bit mask of flags providing further
> >> information on the event. In the current
> >> implementation, this field is always zero.
> >>
> >> data This is a seccomp_data structure containing
> >> information about the system call that triggered
> >> the notification. This is the same structure that
> >> is passed to the seccomp filter. See seccomp(2)
> >> for details of this structure.
> >>
> >> On success, this operation returns 0; on failure, -1 is
> >> returned, and errno is set to indicate the cause of the
> >> error. This operation can fail with the following errors:
> >>
> >> EINVAL (since Linux 5.5)
> >> The seccomp_notif structure that was passed to the
> >> call contained nonzero fields.
> >>
> >> ENOENT The target thread was killed by a signal as the
> >> notification information was being generated, or
> >> the target's (blocked) system call was interrupted
> >> by a signal handler.
> >>
> >
> > I think I commented in another thread somewhere that the supervisor is not
> > notified if the syscall is preempted. Therefore if it is performing a
> > preemptible, long-running syscall, you need to poll
> > SECCOMP_IOCTL_NOTIF_ID_VALID in the background, otherwise you can
> > end up in a bad situation -- like leaking resources, or holding on to
> > file descriptors after the program under supervision has intended to
> > release them.
>
> It's been a long day, and I'm not sure I reallu understand this.
> Could you outline the scnario in more detail?
>
S: Sets up filter + interception for accept
T: socket(AF_INET, SOCK_STREAM, 0) = 7
T: bind(7, {127.0.0.1, 4444}, ..)
T: listen(7, 10)
T: pidfd_getfd(T, 7) = 7 # For the sake of discussion.
T: accept(7, ...)
S: Intercepts accept
S: Does accept in background
T: Receives signal, and accept(...) responds in EINTR
T: close(7)
S: Still running accept(7, ....), holding port 4444, so if now T retries
to bind to port 4444, things fail.

> > A very specific example is if you're performing an accept on behalf
> > of the program generating the notification, and the program intends
> > to reuse the port. You can get into all sorts of awkward situations
> > there.
>
> [...]
>
See above

> > SECCOMP_IOCTL_NOTIF_ADDFD (Since Linux v5.9)
> > This operations is used by the supervisor to add a file
> > descriptor to the process that generated the notification.
> > This can be used by the supervisor to enable "emulation"
> > [Probably a better word] of syscalls which return file
> > descriptors, such as socket(2), or open(2).
> >
> > When the file descriptor is received by the process that
> > is associated with the notification / cookie, it follows
> > SCM_RIGHTS like semantics, and is evaluated by MAC.
>
> I'm not sure what you mean by SCM_RIGHTS like semantics. Do you mean,
> the file descriptor refers to the same open file description
> ('struct file')?
>
> "is evaluated by MAC"... Do you mean something like: the FD is
> subject to LSM checks?
>
The same model of SCM_RIGHTS, where it's checked against LSMs in the same way,
and if your lsm hooks in, it'll activate the same hook as moving the file via
SCM_RIGHTS would trigger. Also, SCM_RIGHTS does result in some aspects of the fd
being shared and others being different (like flags). Perhaps there's a better
term to describe these semantics.

RE: Evaluated by MAC - yes, checked by LSMs.

> > In addition, if it is a socket, it inherits the cgroup
> > v1 classid and netprioidx of the receiving process.
> >
> > The argument of this is as follows:
> >
> > struct seccomp_notif_addfd {
> > __u64 id;
> > __u32 flags;
> > __u32 srcfd;
> > __u32 newfd;
> > __u32 newfd_flags;
> > };
> >
> > id
> > This is the cookie value that was obtained using
> > SECCOMP_IOCTL_NOTIF_RECV.
> >
> > flags
> > A bitmask that includes zero or more of the
> > SECCOMP_ADDFD_FLAG_* bits set
> >
> > SECCOMP_ADDFD_FLAG_SETFD - Use dup2 (or dup3?)
> > like semantics when copying the file
> > descriptor.
> >
> > srcfd
> > The file descriptor number to copy in the
> > supervisor process.
> >
> > newfd
> > If the SECCOMP_ADDFD_FLAG_SETFD flag is specified
> > this will be the file descriptor that is used
> > in the dup2 semantics. If this file descriptor
> > exists in the receiving process, it is closed
> > and replaced by this file descriptor in an
> > atomic fashion. If the copy process fails
> > due to a MAC failure, or if srcfd is invalid,
> > the newfd will not be closed in the receiving
> > process.
>
> Great description!
>
> > If SECCOMP_ADDFD_FLAG_SETFD it not set, then
> > this value must be 0.
> >
> > newfd_flags
> > The file descriptor flags to set on
> > the file descriptor after it has been received
> > by the process. The only flag that can currently
> > be specified is O_CLOEXEC.
> >
> > On success, this operation returns the file descriptor
> > number in the receiving process. On failure, -1 is returned.
> >
> > It can fail with the following error codes:
> >
> > EINPROGRESS
> > The cookie number specified hasn't been received
> > by the listener
>
> I don't understand this. Can you say more about the scenario?
>

This should not really happen. But if you do a ADDFD(...), on a notification
*before* you've received it, you will get this error. So for example,
--> epoll(....) -> returns
--> RECV(...) cookie id is 777
--> epoll(...) -> returns
<-- ioctl(ADDFD, id = 778) # Notice how we haven't done a receive yet
where we've received a notification for 778.

> > ENOENT
> > The cookie number is not valid. This can happen
> > if a response has already been sent, or if the
> > syscall was interrupted
> >
> > EBADF
> > If the file descriptor specified in srcfd is
> > invalid, or if the fd is out of range of the
> > destination program.
>
> The piece "or if the fd is out of range of the destination
> program" is not clear to me. Can you say some more please.
>

IIRC the maximum fd range is specific in proc by some sysctl named
nr_open. It's also evaluated against RLIMITs, and nr_max.

If nr-open (maximum fds open per process, iiirc) is 1000, even
if 10 FDs are open, it wont work if newfd is 1001.

> > EINVAL
> > If flags or new_flags were unrecognized, or
> > if newfd is non-zero, and SECCOMP_ADDFD_FLAG_SETFD
> > has not been set.
> >
> > EMFILE
> > Too many files are open by the destination process.
> >
> > [there's other error codes possible, like from the LSMs
> > or if memory can't be read / written or ebusy]
> >
> > Does this help?
>
> It's a good start!
>
> Thanks,
>
> Michael
>
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/