Re: [PATCH 2/2] seccomp.2: document userspace notification

From: Michael Kerrisk (man-pages)
Date: Thu Feb 28 2019 - 07:52:27 EST

Next message: Wei Liu: "Re: [PATCH v2] xen-netback: fix occasional leak of grant ref mappings under memory pressure"
Previous message: Chris Down: "Re: [PATCH] mm, memcg: Handle cgroup_disable=memory when getting memcg protection"
Next in thread: Michael Kerrisk (man-pages): "Re: [PATCH 2/2] seccomp.2: document userspace notification"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

[widening CC to the same scope as the code patches]

Hello Tycho,

Sorry for the late response on this. Could you amend/rebase your patches
in light of the below please.

And anyone's comment on my "big picture" text at the foot of this
mail would be very welcome.

On 12/13/18 1:11 AM, Tycho Andersen wrote:
> Here's some documentation on how to use the userspace notification. I tried
> to mention the biggest TOCTOU gotcha, but the pointer to the sample is
> really the best thing,

Cough! No, the "best thing" really is a good sample program *and* really
good documentation :-).

> as it has actual code and many comments about what's
> going on.
>
> Signed-off-by: Tycho Andersen <tycho@xxxxxxxx>
> CC: Kees Cook <keescook@xxxxxxxxxxxx>
> ---
> man2/seccomp.2 | 85 +++++++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 84 insertions(+), 1 deletion(-)
>
> diff --git a/man2/seccomp.2 b/man2/seccomp.2
> index d69187783..fef8914bf 100644
> --- a/man2/seccomp.2
> +++ b/man2/seccomp.2
> @@ -228,6 +228,13 @@ An administrator may override this filter flag by preventing specific
> actions from being logged via the
> .IR /proc/sys/kernel/seccomp/actions_logged
> file.
> +.TP
> +.BR SECCOMP_FILTER_FLAG_NEW_LISTENER " (since Linux 4.21)"
> +With this flag, a new userspace notification fd is returned on success. When
> +this filter returns
> +.BR SECCOMP_RET_USER_NOTIF
> +a notification will be sent to this fd. See "Userspace Notification" below for

s/fd/file descriptor/ throughout please.

> +more details.

I think the description here could be better worded as something like:

SECCOMP_FILTER_FLAG_NEW_LISTENER
Register a new filter, as usual, but on success return a
new file descriptor that provides user-space notifications.
When the filter returns SECCOMP_RET_USER_NOTIF, a notification
will be provided via this file descriptor. The close-on-exec
flag is automatically set on the new file descriptor. ...

> .RE
> .TP
> .BR SECCOMP_GET_ACTION_AVAIL " (since Linux 4.14)"
> @@ -606,6 +613,17 @@ file.
> .TP
> .BR SECCOMP_RET_ALLOW
> This value results in the system call being executed.
> +.TP
> +.BR SECCOMP_RET_USER_NOTIF " (since Linux 4.21)"

Please see the start of this hanging list in the manual page.
Can you confirm that SECCOMP_RET_USER_NOTIF really is the lowest
in the precedence order of all of the filter return values?

> +Forwards the syscall to an attached listener in userspace to allow userspace to

s/syscall/system call throughout please.

> +decide what to do with the syscall. If there is no attached listener (either
> +because the filter was not installed with the
> +.BR SECCOMP_FILTER_FLAG_NEW_LISTENER
> +or because the fd was closed), the filter returns
> +.BR ENOSYS
> +similar to what happens when a filter returns
> +.BR SECCOMP_RET_TRACE
> +and there is no tracer. See "Userspace Notification" below for more details.
> .PP
> If an action value other than one of the above is specified,
> then the filter action is treated as either
> @@ -693,10 +711,75 @@ Otherwise, if kernel auditing is enabled and the process is being audited
> the action is logged.
> .IP *
> Otherwise, the action is not logged.
> +.SS Userspace Notification
> +Interactin userspace notification functionality in seccomp is primarily done
> +via file descriptor.

That sentence is somewhat garbled. Even if I correct the typo,
I still don't really understand it. Could you try again?
(Or, take a look at my "big picture" text at the foot of this mail.)

> This file descriptor can be obtained by passing
> +.BR SECCOMP_FILTER_FLAG_NEW_LISTENER
> +as a filter flag when installing a new filter.
> +.PP
> +Once an fd is obtained, userspace can wait for events using
> +.BR poll ()

and presumably select() and epoll?

What kind of notification event do poll(2)/epoll(7)/select(2) provide?
It looks to be POLLIN/EPOLLIN/readable. Is that correct?
These details should be noted here. More generally, my assumption
is that you can use poll(2)/epoll(7)/select(2) to find out about the
availability of an event, and then use SECCOMP_IOCTL_NOTIF_RECV
to read that event. Correct? The text needs to be more explicit on
this.

> +or
> +.BR ioctl ().
> +The supported
> +.BR ioctl ()
> +operations on a notification fd are:
> +.TP
> +.BR SECCOMP_IOCTL_NOTIF_RECV
> +The argument to this command should be a pointer to a struct seccomp_notif:
> +.IP
> +.in +4n
> +.EX
> +struct seccomp_notif {
> + __u64 id;
> + __u32 pid;
> + __u32 flags;
> + struct seccomp_data data;
> +};
> +.EE
> +.in
> +.IP
> +The id field is a filter-unique id for this syscall, and should be supplied in
> +the response. It can additionally be used in
> +.BR SECCOMP_IOCTL_ID_VALID
> +to test whether or not the request is still alive. The pid here is the pid of
> +the task as visible from the listener's pid namespace. If the pid is not
> +visible, it is 0. Flags is unused right now.

So, is 'flags' explicitly zeroed by the kernel? the manual page should note
this.

> struct seccomp_data is the same
> +data that would be passed to a filter running in the kernel.

What are the semantics if multiple monitoring processes are employing
SECCOMP_IOCTL_NOTIF_RECV? Does only one of them get awoken? (Which one?)
Or do they all get woken up and get a 'struct seccomp_notif'? The semantics
should be detailed here.

> +.TP
> +.BR SECCOMP_IOCTL_NOTIF_SEND
> +The argument to this command should be a pointer to a struct seccomp_notif_resp:
> +.IP
> +.in +4n
> +.EX
> +struct seccomp_notif_resp {
> + __u64 id;
> + __s64 val;
> + __s32 error;
> + __u32 flags;
> +};
> +.EE
> +.in
> +.IP
> +The id should be the id from struct seccomp_notif; if error is non-zero, it is
> +used as the return value, otherwise val is. Flags must be 0 right now.

There really isn't enough detail here to help the reader understand. What
is the purpose of 'error' vs 'val'? In particular, I'm assuming that the
monitoring process has (at least) two choices with respect to the system
call being made by the target process:

(1) Cause the system call to fail with an error.
(2) Allow the system call to proceed.

How are 'error' and 'val' used in these two cases?
Are there more than these two cases? How else is 'val' used?

Also, what happens if the monitoring process does a SECCOMP_IOCTL_NOTIF_RECV,
but does not subsequently do a SECCOMP_IOCTL_NOTIF_SEND? This should be
detailed in the manual page.

What are the semantics if multiple monitoring processes are employing
SECCOMP_IOCTL_NOTIF_SEND? An error for all but the first? The semantics
should be detailed here.

> +.TP
> +.BR SECCOMP_IOCTL_NOTIF_ID_VALID
> +The argument to this command is just the id to be tested. There is a pid reuse
> +issue where a task may make a syscall, die, its pid get re-used, and the
> +listener may operate on the wrong pid. So the process for handling modifying a
> +pid's state in some way would be to open the pid's resources (/proc/pid/mem or
> +similar), do this call to verify that the id is still valid, and then use the
> +resource. See the sample below for more detail.
> +.PP
> +A complete example is available in the kernel tree at
> +.IR samples/seccomp/user-trap.c .
> .SH RETURN VALUE
> On success,
> .BR seccomp ()
> -returns 0.
> +returns 0, unless
> +.BR SECCOMP_FILTER_FLAG_NEW_LISTENER
> +was specified, in which case it returns the fd number for the new listener fd.
> On error, if
> .BR SECCOMP_FILTER_FLAG_TSYNC
> was used,

More generally though, I'm struggling with this patch because "the
big picture" is lacking. Here's my estimate of what I think that
picture is:

[[
Instead of having the seccomp() filter decision being made
within the filter itself, it is possible to pass off the
decision making to a (different) user-space process that
makes the decision. This is done using the following steps.

1. The "target process" (i.e., process that will
establish a seccomp filter) and the process that will make
decisions about system calls ("the monitoring process")
establish a connection using a UNIX domain socket. This
socket will subsequently be used to exchange a file
descriptor.

2. The "target process" establishes a seccomp BPF filter in the
usual manner, but with two notable differences:

* The seccomp() 'flags' argument includes the flag
SECCOMP_FILTER_FLAG_NEW_LISTENER. Consequently, the return
value of a successful seccomp() call is a new "listening"
file descriptor that can be used for monitoring.
* In cases where it is needed, the BPF filter returns the
special action SECCOMP_RET_USER_NOTIF. This return value
will trigger a notification event.

3. The "target process" passes the "listening file descriptor"
to the "monitoring process" via the UNIX domain socket.

4. The target process then performs its workload, which includes
system calls that will be controlled by the BPF filter.
Whenever one of these system calls causes the BPF filter to
return SECCOMP_RET_USER_NOTIF, a notification event is
generated on the listening file descriptor.

5. The monitoring process will receive notification events
on the listening file descriptor. These events are returned
as structures of type 'seccomp_notif'. Because this structure
and its size may evolve over kernel version, the monitoring
process must first determine the size of this structure using
the seccomp() SECCOMP_GET_NOTIF_SIZES operation, which
returns a structure of type 'seccomp_notif_sizes'. The
monitoring process allocates a buffer of size
'seccomp_notif_sizes.seccomp_notif' bytes to receive
monitoring events. In addition,the monitoring process
allocates another buffer of size
'seccomp_notif_sizes.seccomp_notif_resp' bytes for the
response (a 'struct seccomp_notif_resp') that it will
provide to the kernel in order to advise how the system
call being made by the target process shall be treated.

6. The monitoring process can now repeatedly monitor the
listening file descriptor.

When a SECCOMP_RET_USER_NOTIF-triggered event occurs,
the file descriptor will test as readable for
poll(2)/epoll(7)/select(2). The call

ioctl(listenfd, SECCOMP_IOCTL_NOTIF_RECV, reqptr)

can be used to read info on the event; this operation
blocks until an event is available. It populates
the 'struct seccomp_notif' pointed to by the third
argument with information about the system call
that is being attempted by the target process.

7. The monitoring process can use the information in the
'struct seccomp_notif' to make a determination about the
system call being made by the target process. This
structure includes a 'data' field that is the same
'struct seccomp_data' that is passed to a BPF filter.

In addition, the monitoring process may make use of other
information that is available from user space. For example,
it may inspect the memory of the target process (whose PID
is provided in the 'struct seccomp_notif') using
/proc/PID/mem, which includes inspecting the values
pointed to by system call arguments (whose location is
available 'seccomp_notif.data.args). However, when using
the target process PID in this way, one must guard against
PID re-use race conditions using the seccomp()
SECCOMP_IOCTL_NOTIF_ID_VALID operation.

8. Having arrived at a decision about the target process's
system call, the monitoring process can inform the kernel
of its decision using the operation

ioctl(listenfd, SECCOMP_IOCTL_NOTIF_SEND, respptr)

where the third argument is a pointer to a
'struct seccomp_notif_resp'. [Some more details
needed here, but I still don't yet understand fully
the semantics of the 'error' and 'val' fields.]

I'd like to include text along the above lines at the start of
the "Userspace Notification" section. Could you take a look at
the above and let me know any errors, improved terminology, or
other improvements generally. I'll then work that text into
the page.

Thanks,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Next message: Wei Liu: "Re: [PATCH v2] xen-netback: fix occasional leak of grant ref mappings under memory pressure"
Previous message: Chris Down: "Re: [PATCH] mm, memcg: Handle cgroup_disable=memory when getting memcg protection"
Next in thread: Michael Kerrisk (man-pages): "Re: [PATCH 2/2] seccomp.2: document userspace notification"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]