Re: For review: seccomp_user_notif(2) manual page

From: Michael Kerrisk (man-pages)
Date: Mon Oct 26 2020 - 05:44:51 EST


Hello Kees,

On 10/26/20 1:19 AM, Kees Cook wrote:
> On Thu, Oct 15, 2020 at 01:24:03PM +0200, Michael Kerrisk (man-pages) wrote:
>> On 10/1/20 1:39 AM, Kees Cook wrote:
>>> I'll comment more later, but I've run out of time today and I didn't see
>>> anyone mention this detail yet in the existing threads... :)
>>
>> Later never came :-). But, I hope you may have comments for the
>> next draft, which I will send out soon.
>
> Later is now, and Soon approaches!
>
> I finally caught up and read through this whole thread. Thank you all
> for the bug fix[1], and I'm looking forward to more[2]. :)


> For my reply I figured I'd base it on the current draft, so here's a
> simulated quote based on the seccomp_user_notif branch of
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git
> through commit 71101158fe330af5a26552447a0bb433b69e15b7
> $ COLUMNS=75 man --nh --nj man2/seccomp_user_notif.2 | sed 's/^/> /'

Thanks for reviewing the latest version!

> On Sun, Oct 25, 2020 at 01:54:05PM +0100, Michael Kerrisk (man-pages) wrote:
>> SECCOMP_USER_NOTIF(2) Linux Programmer's Manual SECCOMP_USER_NOTIF(2)
>>
>> NAME
>> seccomp_user_notif - Seccomp user-space notification mechanism
>>
>> SYNOPSIS
>> #include <linux/seccomp.h>
>> #include <linux/filter.h>
>> #include <linux/audit.h>
>>
>> int seccomp(unsigned int operation, unsigned int flags, void *args);
>>
>> #include <sys/ioctl.h>
>>
>> int ioctl(int fd, SECCOMP_IOCTL_NOTIF_RECV,
>> struct seccomp_notif *req);
>> int ioctl(int fd, SECCOMP_IOCTL_NOTIF_SEND,
>> struct seccomp_notif_resp *resp);
>> int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ID_VALID, __u64 *id);
>>
>> DESCRIPTION
>> This page describes the user-space notification mechanism provided
>> by the Secure Computing (seccomp) facility. As well as the use of
>> the SECCOMP_FILTER_FLAG_NEW_LISTENER flag, the
>> SECCOMP_RET_USER_NOTIF action value, and the
>> SECCOMP_GET_NOTIF_SIZES operation described in seccomp(2), this
>> mechanism involves the use of a number of related ioctl(2)
>> operations (described below).
>>
>> Overview
>> In conventional usage of a seccomp filter, the decision about how
>> to treat a system call is made by the filter itself. By contrast,
>> the user-space notification mechanism allows the seccomp filter to
>> delegate the handling of the system call to another user-space
>> process. Note that this mechanism is explicitly not intended as a
>> method implementing security policy; see NOTES.
>>
>> In the discussion that follows, the thread(s) on which the seccomp
>> filter is installed is (are) referred to as the target, and the
>> process that is notified by the user-space notification mechanism
>> is referred to as the supervisor.
>>
>> A suitably privileged supervisor can use the user-space
>> notification mechanism to perform actions on behalf of the target.
>> The advantage of the user-space notification mechanism is that the
>> supervisor will usually be able to retrieve information about the
>> target and the performed system call that the seccomp filter
>> itself cannot. (A seccomp filter is limited in the information it
>> can obtain and the actions that it can perform because it is
>> running on a virtual machine inside the kernel.)
>>
>> An overview of the steps performed by the target and the
>> supervisor is as follows:
>>
>> 1. The target establishes a seccomp filter in the usual manner,
>> but with two differences:
>>
>> • The seccomp(2) flags argument includes the flag
>> SECCOMP_FILTER_FLAG_NEW_LISTENER. Consequently, the return
>> value of the (successful) seccomp(2) call is a new
>
> nit: extra space

Thanks. Fixed.

>> "listening" file descriptor that can be used to receive
>> notifications. Only one "listening" seccomp filter can be
>> installed for a thread.
>
> I like this limitation, but I expect that it'll need to change in the
> future. Even with LSMs, we see the need for arbitrary stacking, and the
> idea of there being only 1 supervisor will eventually break down. Right
> now there is only 1 because only container managers are using this
> feature. But if some daemon starts using it to isolate some thread,
> suddenly it might break if a container manager is trying to listen to it
> too, etc. I expect it won't be needed soon, but I do think it'll change.

Thanks for the background. (I added your text in a comment in the
page, just for my own reference in the future.)

>> • In cases where it is appropriate, the seccomp filter returns
>> the action value SECCOMP_RET_USER_NOTIF. This return value
>> will trigger a notification event.
>>
>> 2. In order that the supervisor can obtain notifications using the
>> listening file descriptor, (a duplicate of) that file
>> descriptor must be passed from the target to the supervisor.
>
> Yet another reason to have an "activate on exec" mode for seccomp. With

Funnily enough, I was having an in-person conversation just last week
with someone else who was interested in "activate-on-exec".

> no_new_privs _not_ being delayed in such a way, I think it'd be safe to
> add. The supervisor would get the fd immediately, and then once it
> fork/execed suddenly the whole thing would activate, and no fd passing
> needed.
>
> The "on exec" boundary is really only needed for oblivious targets. For
> a coordinated target, I've thought it might be nice to have an arbitrary
> "go" point, where the target could call seccomp() with something like a
> SECCOMP_ACTIVATE_DELAYED_FILTERS operation. This lets any process
> initialization happen that might need to do things that would be blocked
> by filters, etc.
>
> Before:
>
> fork
> install some filters that don't block initialization
> exec
> do some initialization
> install more filters, maybe block exec, seccomp
> run
>
> After:
>
> fork
> install delayed filters
> exec
> do some initialization
> activate delayed filters
> run
>
> In practice, the two-stage filter application has been fine, if
> sometimes a bit complex (e.g. for user_notif, "do some initialization"
> includes figuring out how to pass the fd back to the supervisor, etc).

Yes, something like what you describe above would certainly make some
uses easier. Activate-on-exec seems to me the most compelling need
though..

>> One way in which this could be done is by passing the file
>> descriptor over a UNIX domain socket connection between the
>> target and the supervisor (using the SCM_RIGHTS ancillary
>> message type described in unix(7)).
>>
>> 3. The supervisor will receive notification events on the
>> listening file descriptor. These events are returned as
>> structures of type seccomp_notif. Because this structure and
>> its size may evolve over kernel versions, the supervisor must
>> first determine the size of this structure using the seccomp(2)
>> SECCOMP_GET_NOTIF_SIZES operation, which returns a structure of
>> type seccomp_notif_sizes. The supervisor allocates a buffer of
>> size seccomp_notif_sizes.seccomp_notif bytes to receive
>> notification events. In addition,the supervisor allocates
>> another buffer of size seccomp_notif_sizes.seccomp_notif_resp
>> bytes for the response (a struct seccomp_notif_resp structure)
>> that it will provide to the kernel (and thus the target).
>>
>> 4. The target then performs its workload, which includes system
>> calls that will be controlled by the seccomp filter. Whenever
>> one of these system calls causes the filter to return the
>> SECCOMP_RET_USER_NOTIF action value, the kernel does not (yet)
>> execute the system call; instead, execution of the target is
>> temporarily blocked inside the kernel (in a sleep state that is
>> interruptible by signals) and a notification event is generated
>> on the listening file descriptor.
>>
>> 5. The supervisor can now repeatedly monitor the listening file
>> descriptor for SECCOMP_RET_USER_NOTIF-triggered events. To do
>> this, the supervisor uses the SECCOMP_IOCTL_NOTIF_RECV ioctl(2)
>> operation to read information about a notification event; this
>> operation blocks until an event is available. The operation
>> returns a seccomp_notif structure containing information about
>> the system call that is being attempted by the target.
>>
>> 6. The seccomp_notif structure returned by the
>> SECCOMP_IOCTL_NOTIF_RECV operation includes the same
>> information (a seccomp_data structure) that was passed to the
>> seccomp filter. This information allows the supervisor to
>> discover the system call number and the arguments for the
>> target's system call. In addition, the notification event
>> contains the ID of the thread that triggered the notification.
>
> Should "cookie" be at least named here, just to provide a bit more
> context for when it is mentioned in 8 below? E.g.:
>
> ... In addition, the notification event
> contains the triggering thread's ID and a unique cookie to be
> used in subsequent SECCOMP_IOCTL_NOTIF_ID_VALID and
> SECCOMP_IOCTL_NOTIF_SEND operations.

Good catch! Changed as you suggest. (And thanks so much for all
your suggested rewordings; that makes things *much* easier for me.)

>> The information in the notification can be used to discover the
>> values of pointer arguments for the target's system call.
>> (This is something that can't be done from within a seccomp
>> filter.) One way in which the supervisor can do this is to
>> open the corresponding /proc/[tid]/mem file (see proc(5)) and
>> read bytes from the location that corresponds to one of the
>> pointer arguments whose value is supplied in the notification
>> event. (The supervisor must be careful to avoid a race
>> condition that can occur when doing this; see the description
>> of the SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.)
>> In addition, the supervisor can access other system information
>> that is visible in user space but which is not accessible from
>> a seccomp filter.
>>
>> 7. Having obtained information as per the previous step, the
>> supervisor may then choose to perform an action in response to
>> the target's system call (which, as noted above, is not
>> executed when the seccomp filter returns the
>> SECCOMP_RET_USER_NOTIF action value).
>>
>> One example use case here relates to containers. The target
>> may be located inside a container where it does not have
>> sufficient capabilities to mount a filesystem in the
>> container's mount namespace. However, the supervisor may be a
>> more privileged process that does have sufficient capabilities
>> to perform the mount operation.
>>
>> 8. The supervisor then sends a response to the notification. The
>> information in this response is used by the kernel to construct
>> a return value for the target's system call and provide a value
>> that will be assigned to the errno variable of the target.
>>
>> The response is sent using the SECCOMP_IOCTL_NOTIF_SEND
>> ioctl(2) operation, which is used to transmit a
>> seccomp_notif_resp structure to the kernel. This structure
>> includes a cookie value that the supervisor obtained in the
>> seccomp_notif structure returned by the
>> SECCOMP_IOCTL_NOTIF_RECV operation. This cookie value allows
>> the kernel to associate the response with the target.
>
> Describing where the cookie came from seems like it should live in 6
> above. A reader would have to take this new info and figure out where
> SECCOMP_IOCTL_NOTIF_RECV was described and piece it together.

Yeah. I hate it when the documentation loses the reader like that :-}.

> With the
> suggestion to 6 above, maybe:
>
> ... This structure
> must include the cookie value that the supervisor obtained in
> the seccomp_notif structure returned by the
> SECCOMP_IOCTL_NOTIF_RECV operation, which allows the kernel
> to associate the response with the target.

Great! Changed.

>> 9. Once the notification has been sent, the system call in the
>> target thread unblocks, returning the information that was
>> provided by the supervisor in the notification response.
>>
>> As a variation on the last two steps, the supervisor can send a
>> response that tells the kernel that it should execute the target
>> thread's system call; see the discussion of
>> SECCOMP_USER_NOTIF_FLAG_CONTINUE, below.
>>
>> ioctl(2) operations
>> The following ioctl(2) operations are provided to support seccomp
>> user-space notification. For each of these operations, the first
>> (file descriptor) argument of ioctl(2) is the listening file
>> descriptor returned by a call to seccomp(2) with the
>> SECCOMP_FILTER_FLAG_NEW_LISTENER flag.
>>
>> SECCOMP_IOCTL_NOTIF_RECV
>> This operation is used to obtain a user-space notification
>> event. If no such event is currently pending, the
>> operation blocks until an event occurs. The third ioctl(2)
>> argument is a pointer to a structure of the following form
>> which contains information about the event. This structure
>> must be zeroed out before the call.
>>
>> struct seccomp_notif {
>> __u64 id; /* Cookie */
>> __u32 pid; /* TID of target thread */
>
> Should we rename this variable from pid to tid? Yes it's UAPI, but yay for
> anonymous unions:
>
> struct seccomp_notif {
> __u64 id; /* Cookie */
> union {
> __u32 pid;
> __u32 tid; /* TID of target thread */
> };
> __u32 flags; /* Currently unused (0) */
> struct seccomp_data data; /* See seccomp(2) */
> };

Yes, it would be nice to make this change. But, already there
are so many places in the UAPI where the pid/tid is messed upp :-(.

>> __u32 flags; /* Currently unused (0) */
>> struct seccomp_data data; /* See seccomp(2) */
>> };
>>
>> The fields in this structure are as follows:
>>
>> id This is a cookie for the notification. Each such
>> cookie is guaranteed to be unique for the
>> corresponding seccomp filter.
>>
>> • It can be used with the
>> SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation to
>> verify that the target is still alive.
>>
>> • When returning a notification response to the
>> kernel, the supervisor must include the cookie
>> value in the seccomp_notif_resp structure that is
>> specified as the argument of the
>> SECCOMP_IOCTL_NOTIF_SEND operation.
>>
>> pid This is the thread ID of the target thread that
>> triggered the notification event.
>>
>> flags This is a bit mask of flags providing further
>> information on the event. In the current
>> implementation, this field is always zero.
>>
>> data This is a seccomp_data structure containing
>> information about the system call that triggered the
>> notification. This is the same structure that is
>> passed to the seccomp filter. See seccomp(2) for
>> details of this structure.
>>
>> On success, this operation returns 0; on failure, -1 is
>> returned, and errno is set to indicate the cause of the
>> error. This operation can fail with the following errors:
>>
>> EINVAL (since Linux 5.5)
>> The seccomp_notif structure that was passed to the
>> call contained nonzero fields.
>>
>> ENOENT The target thread was killed by a signal as the
>> notification information was being generated, or the
>> target's (blocked) system call was interrupted by a
>> signal handler.
>>
>> SECCOMP_IOCTL_NOTIF_ID_VALID
>> This operation can be used to check that a notification ID
>> returned by an earlier SECCOMP_IOCTL_NOTIF_RECV operation
>> is still valid (i.e., that the target still exists).
>
> Maybe clarify a bit more, since it's covering more than just "is the
> target still alive", but also "is that syscall still waiting for a
> response":
>
> is still valid (i.e., that the target still exists and
> the syscall is still blocked waiting for a response).

Thanks. I made it:

(i.e., that the target still exists and its system call
is still blocked waiting for a response).

>> The third ioctl(2) argument is a pointer to the cookie (id)
>> returned by the SECCOMP_IOCTL_NOTIF_RECV operation.
>>
>> This operation is necessary to avoid race conditions that
>> can occur when the pid returned by the
>> SECCOMP_IOCTL_NOTIF_RECV operation terminates, and that
>> process ID is reused by another process. An example of
>> this kind of race is the following
>>
>> 1. A notification is generated on the listening file
>> descriptor. The returned seccomp_notif contains the TID
>> of the target thread (in the pid field of the
>> structure).
>>
>> 2. The target terminates.
>>
>> 3. Another thread or process is created on the system that
>> by chance reuses the TID that was freed when the target
>> terminated.
>>
>> 4. The supervisor open(2)s the /proc/[tid]/mem file for the
>> TID obtained in step 1, with the intention of (say)
>> inspecting the memory location(s) that containing the
>> argument(s) of the system call that triggered the
>> notification in step 1.
>>
>> In the above scenario, the risk is that the supervisor may
>> try to access the memory of a process other than the
>> target. This race can be avoided by following the call to
>> open(2) with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to
>> verify that the process that generated the notification is
>> still alive. (Note that if the target terminates after the
>> latter step, a subsequent read(2) from the file descriptor
>> may return 0, indicating end of file.)
>>
>> On success (i.e., the notification ID is still valid), this
>> operation returns 0. On failure (i.e., the notification ID
>> is no longer valid), -1 is returned, and errno is set to
>> ENOENT.
>>
>> SECCOMP_IOCTL_NOTIF_SEND
>> This operation is used to send a notification response back
>> to the kernel. The third ioctl(2) argument of this
>> structure is a pointer to a structure of the following
>> form:
>>
>> struct seccomp_notif_resp {
>> __u64 id; /* Cookie value */
>> __s64 val; /* Success return value */
>> __s32 error; /* 0 (success) or negative
>> error number */
>> __u32 flags; /* See below */
>> };
>>
>> The fields of this structure are as follows:
>>
>> id This is the cookie value that was obtained using the
>> SECCOMP_IOCTL_NOTIF_RECV operation. This cookie
>> value allows the kernel to correctly associate this
>> response with the system call that triggered the
>> user-space notification.
>>
>> val This is the value that will be used for a spoofed
>> success return for the target's system call; see
>> below.
>>
>> error This is the value that will be used as the error
>> number (errno) for a spoofed error return for the
>> target's system call; see below.
>>
>> flags This is a bit mask that includes zero or more of the
>> following flags:
>>
>> SECCOMP_USER_NOTIF_FLAG_CONTINUE (since Linux 5.5)
>> Tell the kernel to execute the target's
>> system call.
>>
>> Two kinds of response are possible:
>>
>> • A response to the kernel telling it to execute the
>> target's system call. In this case, the flags field
>> includes SECCOMP_USER_NOTIF_FLAG_CONTINUE and the error
>> and val fields must be zero.
>>
>> This kind of response can be useful in cases where the
>> supervisor needs to do deeper analysis of the target's
>> system call than is possible from a seccomp filter (e.g.,
>> examining the values of pointer arguments), and, having
>> decided that the system call does not require emulation
>> by the supervisor, the supervisor wants the system call
>> to be executed normally in the target.
>>
>> The SECCOMP_USER_NOTIF_FLAG_CONTINUE flag should be used
>> with caution; see NOTES.
>>
>> • A spoofed return value for the target's system call. In
>> this case, the kernel does not execute the target's
>> system call, instead causing the system call to return a
>> spoofed value as specified by fields of the
>> seccomp_notif_resp structure. The supervisor should set
>> the fields of this structure as follows:
>>
>> + flags does not contain
>> SECCOMP_USER_NOTIF_FLAG_CONTINUE.
>>
>> + error is set either to 0 for a spoofed "success"
>> return or to a negative error number for a spoofed
>> "failure" return. In the former case, the kernel
>> causes the target's system call to return the value
>> specified in the val field. In the later case, the
>> kernel causes the target's system call to return -1,
>> and errno is assigned the negated error value.
>>
>> + val is set to a value that will be used as the return
>> value for a spoofed "success" return for the target's
>> system call. The value in this field is ignored if
>> the error field contains a nonzero value.
>
> Strictly speaking, this is architecture specific, but all architectures
> do it this way. Should seccomp enforce val == 0 when err != 0 ?

That seems a resonable check to add. Initially, I found the absence of
such a check confusing, since it left me wondering: have I understood
the kernel code correctly?

>> On success, this operation returns 0; on failure, -1 is
>> returned, and errno is set to indicate the cause of the
>> error. This operation can fail with the following errors:
>>
>> EINPROGRESS
>> A response to this notification has already been
>> sent.
>>
>> EINVAL An invalid value was specified in the flags field.
>>
>> EINVAL The flags field contained
>> SECCOMP_USER_NOTIF_FLAG_CONTINUE, and the error or
>> val field was not zero.
>>
>> ENOENT The blocked system call in the target has been
>> interrupted by a signal handler or the target has
>> terminated.
>>
>> NOTES
>> select()/poll()/epoll semantics
>> The file descriptor returned when seccomp(2) is employed with the
>> SECCOMP_FILTER_FLAG_NEW_LISTENER flag can be monitored using
>> poll(2), epoll(7), and select(2). These interfaces indicate that
>> the file descriptor is ready as follows:
>>
>> • When a notification is pending, these interfaces indicate that
>> the file descriptor is readable. Following such an indication,
>> a subsequent SECCOMP_IOCTL_NOTIF_RECV ioctl(2) will not block,
>> returning either information about a notification or else
>> failing with the error EINTR if the target has been killed by a
>> signal or its system call has been interrupted by a signal
>> handler.
>>
>> • After the notification has been received (i.e., by the
>> SECCOMP_IOCTL_NOTIF_RECV ioctl(2) operation), these interfaces
>> indicate that the file descriptor is writable, meaning that a
>> notification response can be sent using the
>> SECCOMP_IOCTL_NOTIF_SEND ioctl(2) operation.
>>
>> • After the last thread using the filter has terminated and been
>> reaped using waitpid(2) (or similar), the file descriptor
>> indicates an end-of-file condition (readable in select(2);
>> POLLHUP/EPOLLHUP in poll(2)/ epoll_wait(2)).
>
> I'll reply separately about the "ioctl() does not terminate when all
> filters have terminated" case.

Okay.

>> Design goals; use of SECCOMP_USER_NOTIF_FLAG_CONTINUE
>> The intent of the user-space notification feature is to allow
>> system calls to be performed on behalf of the target. The
>> target's system call should either be handled by the supervisor or
>> allowed to continue normally in the kernel (where standard
>> security policies will be applied).
>>
>> Note well: this mechanism must not be used to make security policy
>> decisions about the system call, which would be inherently race-
>> prone for reasons described next.
>>
>> The SECCOMP_USER_NOTIF_FLAG_CONTINUE flag must be used with
>> caution. If set by the supervisor, the target's system call will
>> continue. However, there is a time-of-check, time-of-use race
>> here, since an attacker could exploit the interval of time where
>> the target is blocked waiting on the "continue" response to do
>> things such as rewriting the system call arguments.
>>
>> Note furthermore that a user-space notifier can be bypassed if the
>> existing filters allow the use of seccomp(2) or prctl(2) to
>> install a filter that returns an action value with a higher
>> precedence than SECCOMP_RET_USER_NOTIF (see seccomp(2)).
>>
>> It should thus be absolutely clear that the seccomp user-space
>> notification mechanism can not be used to implement a security
>> policy! It should only ever be used in scenarios where a more
>> privileged process supervises the system calls of a lesser
>> privileged target to get around kernel-enforced security
>> restrictions when the supervisor deems this safe. In other words,
>> in order to continue a system call, the supervisor should be sure
>> that another security mechanism or the kernel itself will
>> sufficiently block the system call if its arguments are rewritten
>> to something unsafe.
>>
>> Interaction with SA_RESTART signal handlers
>> Consider the following scenario:
>>
>> • The target process has used sigaction(2) to install a signal
>> handler with the SA_RESTART flag.
>>
>> • The target has made a system call that triggered a seccomp user-
>> space notification and the target is currently blocked until the
>> supervisor sends a notification response.
>>
>> • A signal is delivered to the target and the signal handler is
>> executed.
>>
>> • When (if) the supervisor attempts to send a notification
>> response, the SECCOMP_IOCTL_NOTIF_SEND ioctl(2)) operation will
>> fail with the ENOENT error.
>>
>> In this scenario, the kernel will restart the target's system
>> call. Consequently, the supervisor will receive another user-
>> space notification. Thus, depending on how many times the blocked
>> system call is interrupted by a signal handler, the supervisor may
>> receive multiple notifications for the same system call in the
>
> maybe "... for the same instance of a system call in the target." for
> clarity?

Yes, that's a nice clarification.

>> target.
>>
>> One oddity is that system call restarting as described in this
>> scenario will occur even for the blocking system calls listed in
>> signal(7) that would never normally be restarted by the SA_RESTART
>> flag.
>
> Does this need fixing? I imagine the correct behavior for this case
> would be a response to _SEND of EINPROGRESS and the target would see
> EINTR normally?

That sounds reasonable.

> I mean, it's not like seccomp doesn't already expose weirdness with
> syscall restarts. Not even arm64 compat agrees[3] with arm32 in this
> regard. :(

I've added the above comments as a FIXME in the page.

>> BUGS
>> If a SECCOMP_IOCTL_NOTIF_RECV ioctl(2) operation is performed
>> after the target terminates, then the ioctl(2) call simply blocks
>> (rather than returning an error to indicate that the target no
>> longer exists).
>
> I want this fixed. It caused me no end of pain when building the
> selftests, and ended up spawning my implementing a global test timeout
> in kselftest. :P Before the usage counter refactor, there was no sane
> way to deal with this, but now I think we're close[2]. I'll reply
> separately about this.

Also added as FIXME comment in the page :-).

The behavior here is surprising, and caused me some
confusion until I worked out what was going on.

>> EXAMPLES
>> The (somewhat contrived) program shown below demonstrates the use
>> of the interfaces described in this page. The program creates a
>> child process that serves as the "target" process. The child
>> process installs a seccomp filter that returns the
>> SECCOMP_RET_USER_NOTIF action value if a call is made to mkdir(2).
>> The child process then calls mkdir(2) once for each of the
>> supplied command-line arguments, and reports the result returned
>> by the call. After processing all arguments, the child process
>> terminates.
>>
>> The parent process acts as the supervisor, listening for the
>> notifications that are generated when the target process calls
>> mkdir(2). When such a notification occurs, the supervisor
>> examines the memory of the target process (using /proc/[pid]/mem)
>> to discover the pathname argument that was supplied to the
>> mkdir(2) call, and performs one of the following actions:
>
> I like this example! It's simple enough to be understandable and complex
> enough to show the purpose of user_notif. :)

Precisely my aim. Thank you for noticing and appreciating :-).

>> • If the pathname begins with the prefix "/tmp/", then the
>> supervisor attempts to create the specified directory, and then
>> spoofs a return for the target process based on the return value
>> of the supervisor's mkdir(2) call. In the event that that call
>> succeeds, the spoofed success return value is the length of the
>> pathname.
>>
>> • If the pathname begins with "./" (i.e., it is a relative
>> pathname), the supervisor sends a
>> SECCOMP_USER_NOTIF_FLAG_CONTINUE response to the kernel to say
>> that the kernel should execute the target process's mkdir(2)
>> call.
>>
>> • If the pathname begins with some other prefix, the supervisor
>> spoofs an error return for the target process, so that the
>> target process's mkdir(2) call appears to fail with the error
>> EOPNOTSUPP ("Operation not supported"). Additionally, if the
>> specified pathname is exactly "/bye", then the supervisor
>> terminates.

[...]

>> Program source
>> #define _GNU_SOURCE
>> #include <sys/types.h>
>> #include <sys/prctl.h>
>> #include <fcntl.h>
>> #include <limits.h>
>> #include <signal.h>
>> #include <stddef.h>
>> #include <stdint.h>
>> #include <stdbool.h>
>> #include <linux/audit.h>
>> #include <sys/syscall.h>
>> #include <sys/stat.h>
>> #include <linux/filter.h>
>> #include <linux/seccomp.h>
>> #include <sys/ioctl.h>
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <unistd.h>
>> #include <errno.h>
>> #include <sys/socket.h>
>> #include <sys/un.h>
>>
>> #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
>> } while (0)
>
> Because I love macros, you can expand this to make it take a format
> string:
>
> #define errExit(fmt, ...) do { \
> char __err[64]; \
> strerror_r(errno, __err, sizeof(__err)); \
> fprintf(stderr, fmt ": %s\n", ##__VA_ARG__, __err); \
> exit(EXIT_FAILURE); \
> } while (0)

I'm a bit divivided about this. I don't want to distract the reader by
requiring them to understand the macro. I'll leave this for the moment.

[...]

>> static void
>> sigchldHandler(int sig)
>> {
>> char *msg = "\tS: target has terminated; bye\n";
>>
>> write(STDOUT_FILENO, msg, strlen(msg));
>
> white space nit: extra space before "="

Thanks!

> efficiency nit: strlen isn't needed, since it can be done with
> compile-time constant constants:
>
> char msg[] = "\tS: target has terminated; bye\n";
> write(STDOUT_FILENO, msg, sizeof(msg) - 1);
>
> (some optimization levels may already replace the strlen a sizeof - 1)

Changed as you suggest. Thanks!

>> _exit(EXIT_SUCCESS);
>> }

[...]

>> static void
>> checkNotificationIdIsValid(int notifyFd, uint64_t id)
>> {
>> if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_ID_VALID, &id) == -1) {
>> fprintf(stderr, "\tS: notification ID check: "
>> "target has terminated!!!\n");
>>
>> exit(EXIT_FAILURE);
>
> And now you can do:
>
> errExit("\tS: notification ID check: "
> "target has terminated! ioctl");
>
> ;)

Thanks. Changed as you suggest.

>> }
>> }
>>
>> /* Access the memory of the target process in order to discover the
>> pathname that was given to mkdir() */
>>
>> static bool
>> getTargetPathname(struct seccomp_notif *req, int notifyFd,
>> char *path, size_t len)
>> {
>> char procMemPath[PATH_MAX];
>>
>> snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", req->pid);
>>
>> int procMemFd = open(procMemPath, O_RDONLY);
>> if (procMemFd == -1)
>> errExit("Supervisor: open");
>>
>> /* Check that the process whose info we are accessing is still alive.
>> If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed
>> in checkNotificationIdIsValid()) succeeds, we know that the
>> /proc/PID/mem file descriptor that we opened corresponds to the
>> process for which we received a notification. If that process
>> subsequently terminates, then read() on that file descriptor
>> will return 0 (EOF). */
>>
>> checkNotificationIdIsValid(notifyFd, req->id);
>>
>> /* Read bytes at the location containing the pathname argument
>> (i.e., the first argument) of the mkdir(2) call */
>>
>> ssize_t nread = pread(procMemFd, path, len, req->data.args[0]);
>> if (nread == -1)
>> errExit("pread");
>>
>> if (nread == 0) {
>> fprintf(stderr, "\tS: pread() of /proc/PID/mem "
>> "returned 0 (EOF)\n");
>> exit(EXIT_FAILURE);
>> }
>>
>> if (close(procMemFd) == -1)
>> errExit("close-/proc/PID/mem");
>>
>> /* We have no guarantees about what was in the memory of the target
>> process. We therefore treat the buffer returned by pread() as
>> untrusted input. The buffer should be terminated by a null byte;
>> if not, then we will trigger an error for the target process. */
>>
>> for (int j = 0; j < nread; j++)
>> if (path[j] == ' ')
>
> This rendering typo (' ' vs '\0') ends up manifesting badly. ;) The man
> source shows:
>
> if (path[j] == \(aq\0\(aq)
>
> I think this needs to be \\0 ?

Yes, that was the intention.

> Or it could also be a tested as:
>
> if (strnlen(path, nread) < nread)
>

Good point. Changed to:

if (strnlen(path, nread) < nread)
return true;

[...]

>
> Thank you so much for this documentation and example! :)

You're welcome. It's been "interesting" uncovering the glitches :-).

Cheers,

Michael


> [1] https://git.kernel.org/linus/dfe719fef03d752f1682fa8aeddf30ba501c8555
> [2] https://lore.kernel.org/lkml/CAG48ez3kpEDO1x_HfvOM2R9M78Ach9O_4+Pjs-vLLfqvZL+13A@xxxxxxxxxxxxxx/
> [3] https://lore.kernel.org/lkml/CAGXu5jKzif=vp6gn5ZtrTx-JTN367qFphobnt9s=awbaafwoUw@xxxxxxxxxxxxxx/
>


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/