Re: For review: seccomp_user_notif(2) manual page [v2]

From: Sargun Dhillon
Date: Mon Nov 02 2020 - 03:11:57 EST


On Sat, Oct 31, 2020 at 9:27 AM Michael Kerrisk (man-pages)
<mtk.manpages@xxxxxxxxx> wrote:
>
> Hello Sargun,
>
> Thanks for your reply.
>
> On 10/30/20 9:27 PM, Sargun Dhillon wrote:
> > On Thu, Oct 29, 2020 at 09:37:21PM +0100, Michael Kerrisk (man-pages)
> > wrote:
>
> [...]
>
> >>> I think I commented in another thread somewhere that the
> >>> supervisor is not notified if the syscall is preempted. Therefore
> >>> if it is performing a preemptible, long-running syscall, you need
> >>> to poll SECCOMP_IOCTL_NOTIF_ID_VALID in the background, otherwise
> >>> you can end up in a bad situation -- like leaking resources, or
> >>> holding on to file descriptors after the program under
> >>> supervision has intended to release them.
> >>
> >> It's been a long day, and I'm not sure I reallu understand this.
> >> Could you outline the scnario in more detail?
> >>
> > S: Sets up filter + interception for accept T: socket(AF_INET,
> > SOCK_STREAM, 0) = 7 T: bind(7, {127.0.0.1, 4444}, ..) T: listen(7,
> > 10) T: pidfd_getfd(T, 7) = 7 # For the sake of discussion.
>
> Presumably, the preceding line should have been:
>
> S: pidfd_getfd(T, 7) = 7 # For the sake of discussion.
> (s/T:/S:/)
>
> right?

Right.
>
>
> > T: accept(7, ...) S: Intercepts accept S: Does accept in background
> > T: Receives signal, and accept(...) responds in EINTR T: close(7) S:
> > Still running accept(7, ....), holding port 4444, so if now T
> > retries to bind to port 4444, things fail.
>
> Okay -- I understand. Presumably the solution here is not to
> block in accept(), but rather to use poll() to monitor both the
> notification FD and the listening socket FD?
>
You need to have some kind of mechanism to periodically check
if the notification is still alive, and preempt the accept. It doesn't
matter how exactly you "background" the accept (threads, or
O_NONBLOCK + epoll).

The thing is you need to make sure that when the process
cancels a syscall, you need to release the resources you
may have acquired on its behalf or bad things can happen.

> >>> A very specific example is if you're performing an accept on
> >>> behalf of the program generating the notification, and the
> >>> program intends to reuse the port. You can get into all sorts of
> >>> awkward situations there.
> >>
> >> [...]
> >>
> > See above
>
> [...]
>
> >>> In addition, if it is a socket, it inherits the cgroup v1 classid
> >>> and netprioidx of the receiving process.
> >>>
> >>> The argument of this is as follows:
> >>>
> >>> struct seccomp_notif_addfd { __u64 id; __u32 flags; __u32 srcfd;
> >>> __u32 newfd; __u32 newfd_flags; };
> >>>
> >>> id This is the cookie value that was obtained using
> >>> SECCOMP_IOCTL_NOTIF_RECV.
> >>>
> >>> flags A bitmask that includes zero or more of the
> >>> SECCOMP_ADDFD_FLAG_* bits set
> >>>
> >>> SECCOMP_ADDFD_FLAG_SETFD - Use dup2 (or dup3?) like semantics
> >>> when copying the file descriptor.
> >>>
> >>> srcfd The file descriptor number to copy in the supervisor
> >>> process.
> >>>
> >>> newfd If the SECCOMP_ADDFD_FLAG_SETFD flag is specified this will
> >>> be the file descriptor that is used in the dup2 semantics. If
> >>> this file descriptor exists in the receiving process, it is
> >>> closed and replaced by this file descriptor in an atomic fashion.
> >>> If the copy process fails due to a MAC failure, or if srcfd is
> >>> invalid, the newfd will not be closed in the receiving process.
> >>
> >> Great description!
> >>
> >>> If SECCOMP_ADDFD_FLAG_SETFD it not set, then this value must be
> >>> 0.
> >>>
> >>> newfd_flags The file descriptor flags to set on the file
> >>> descriptor after it has been received by the process. The only
> >>> flag that can currently be specified is O_CLOEXEC.
> >>>
> >>> On success, this operation returns the file descriptor number in
> >>> the receiving process. On failure, -1 is returned.
> >>>
> >>> It can fail with the following error codes:
> >>>
> >>> EINPROGRESS The cookie number specified hasn't been received by
> >>> the listener
> >>
> >> I don't understand this. Can you say more about the scenario?
> >>
> >
> > This should not really happen. But if you do a ADDFD(...), on a
> > notification *before* you've received it, you will get this error. So
> > for example,
> > --> epoll(....) -> returns
> > --> RECV(...) cookie id is 777
> > --> epoll(...) -> returns
> > <-- ioctl(ADDFD, id = 778) # Notice how we haven't done a receive yet
> > where we've received a notification for 778.
>
> Got it. Looking also at the source code, I came up with the
> following:
>
> EINPROGRESS
> The user-space notification specified in the id
> field exists but has not yet been fetched (by a
> SECCOMP_IOCTL_NOTIF_RECV) or has already been
> responded to (by a SECCOMP_IOCTL_NOTIF_SEND).
>
> Does that seem okay?
>
Looks good to me.

> >>> ENOENT The cookie number is not valid. This can happen if a
> >>> response has already been sent, or if the syscall was
> >>> interrupted
> >>>
> >>> EBADF If the file descriptor specified in srcfd is invalid, or if
> >>> the fd is out of range of the destination program.
> >>
> >> The piece "or if the fd is out of range of the destination program"
> >> is not clear to me. Can you say some more please.
> >>
> >
> > IIRC the maximum fd range is specific in proc by some sysctl named
> > nr_open. It's also evaluated against RLIMITs, and nr_max.
> >
> > If nr-open (maximum fds open per process, iiirc) is 1000, even if 10
> > FDs are open, it wont work if newfd is 1001.
>
> Actually, the relevant limit seems to be just the RLIMIT_NOFILE
> resource limit at least in my reading of fs/file.c::replace_fd().
> So I made the text
>
> EBADF Allocating the file descriptor in the target would
> cause the target's RLIMIT_NOFILE limit to be
> exceeded (see getrlimit(2)).
>
>

If you're above RLIMIT_NOFILE, you get EBADF.

When we do __receive_fd with a specific fd (newfd specified):
https://elixir.bootlin.com/linux/latest/source/fs/file.c#L1086

it calls replace_fd, which calls expand_files. expand_files
can fail with EMFILE.

> >>> EINVAL If flags or new_flags were unrecognized, or if newfd is
> >>> non-zero, and SECCOMP_ADDFD_FLAG_SETFD has not been set.
> >>>
> >>> EMFILE Too many files are open by the destination process.
>
> I'm not sure that the error can really occur. That's the error
> that in most other places occurs when RLIMIT_NOFILE is exceeded.
> But I may have missed something. More precisely, when do you think
> EMFILE can occur?
>
It can happen if the user specifies a newfd which is too large.

> [...]
>
> Thanks,
>
> Michael
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/