Re: For review: seccomp_user_notif(2) manual page

From: Michael Kerrisk (man-pages)
Date: Wed Sep 30 2020 - 16:35:27 EST


Hi Tycho,

Thanks for taking time to look at the page!

On 9/30/20 5:03 PM, Tycho Andersen wrote:
> On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
>> 2. In order that the supervisor process can obtain notifications
>> using the listening file descriptor, (a duplicate of) that
>> file descriptor must be passed from the target process to the
>> supervisor process. One way in which this could be done is by
>> passing the file descriptor over a UNIX domain socket connec‐
>> tion between the two processes (using the SCM_RIGHTS ancillary
>> message type described in unix(7)). Another possibility is
>> that the supervisor might inherit the file descriptor via
>> fork(2).
>
> It is technically possible to inherit the fd via fork, but is it
> really that useful? The child process wouldn't be able to actually do
> the syscall in question, since it would have the same filter.

D'oh! Yes, of course.

I think I was reaching because in an earlier conversation
you replied:

[[
> 3. The "target process" passes the "listening file descriptor"
> to the "monitoring process" via the UNIX domain socket.

or some other means, it doesn't have to be with SCM_RIGHTS.
]]

So, what other means?

Anyway, I removed the sentence mentioning fork().

>> The information in the notification can be used to discover
>> the values of pointer arguments for the target process's sys‐
>> tem call. (This is something that can't be done from within a
>> seccomp filter.) To do this (and assuming it has suitable
>
> s/To do this/One way to accomplish this/ perhaps, since there are
> others.

Yes, thanks, done.

>> permissions), the supervisor opens the corresponding
>> /proc/[pid]/mem file, seeks to the memory location that corre‐
>> sponds to one of the pointer arguments whose value is supplied
>> in the notification event, and reads bytes from that location.
>> (The supervisor must be careful to avoid a race condition that
>> can occur when doing this; see the description of the SEC‐
>> COMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.) In addi‐
>> tion, the supervisor can access other system information that
>> is visible in user space but which is not accessible from a
>> seccomp filter.
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │Suppose we are reading a pathname from /proc/PID/mem │
>> │for a system call such as mkdir(). The pathname can │
>> │be an arbitrary length. How do we know how much (how │
>> │many pages) to read from /proc/PID/mem? │
>> └─────────────────────────────────────────────────────┘
>
> PATH_MAX, I suppose.

Yes, I misunderstood a fundamental detail here, as Jann
also confirmed.

>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │From my experiments, it appears that if a SEC‐ │
>> │COMP_IOCTL_NOTIF_RECV is done after the target │
>> │process terminates, then the ioctl() simply blocks │
>> │(rather than returning an error to indicate that the │
>> │target process no longer exists). │
>
> Yeah, I think Christian wanted to fix this at some point,

Do you have a pointer that discussion? I could not find it with a
quick search.

> but it's a
> bit sticky to do.

Can you say a few words about the nature of the problem?

In the meantime. I think this merits a note under BUGS, and
I've added one.

> Note that if you e.g. rely on fork() above, the
> filter is shared with your current process, and this notification
> would never be possible. Perhaps another reason to omit that from the
> man page.

(Yes, as noted above, I removed that sentence.)

>> SECCOMP_IOCTL_NOTIF_ID_VALID
>> This operation can be used to check that a notification ID
>> returned by an earlier SECCOMP_IOCTL_NOTIF_RECV operation
>> is still valid (i.e., that the target process still
>> exists).
>>
>> The third ioctl(2) argument is a pointer to the cookie
>> (id) returned by the SECCOMP_IOCTL_NOTIF_RECV operation.
>>
>> This operation is necessary to avoid race conditions that
>> can occur when the pid returned by the SEC‐
>> COMP_IOCTL_NOTIF_RECV operation terminates, and that
>> process ID is reused by another process. An example of
>> this kind of race is the following
>>
>> 1. A notification is generated on the listening file
>> descriptor. The returned seccomp_notif contains the
>> PID of the target process.
>>
>> 2. The target process terminates.
>>
>> 3. Another process is created on the system that by chance
>> reuses the PID that was freed when the target process
>> terminates.
>>
>> 4. The supervisor open(2)s the /proc/[pid]/mem file for
>> the PID obtained in step 1, with the intention of (say)
>> inspecting the memory locations that contains the argu‐
>> ments of the system call that triggered the notifica‐
>> tion in step 1.
>>
>> In the above scenario, the risk is that the supervisor may
>> try to access the memory of a process other than the tar‐
>> get. This race can be avoided by following the call to
>> open with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to ver‐
>> ify that the process that generated the notification is
>> still alive. (Note that if the target process subse‐
>> quently terminates, its PID won't be reused because there
>> remains an open reference to the /proc[pid]/mem file; in
>> this case, a subsequent read(2) from the file will return
>> 0, indicating end of file.)
>>
>> On success (i.e., the notification ID is still valid),
>> this operation returns 0 On failure (i.e., the notifica‐
> ^ need a period?
>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │Interestingly, after the event had been received, │
>> │the file descriptor indicates as writable (verified │
>> │from the source code and by experiment). How is this │
>> │useful? │
>
> You're saying it should just do EPOLLOUT and not EPOLLWRNORM? Seems
> reasonable.

No, I'm saying something more fundamental: why is the FD indicating as
writable? Can you write something to it? If yes, what? If not, then
why do these APIs want to say that the FD is writable?

>> EXAMPLES
>> The (somewhat contrived) program shown below demonstrates the use
>
> May also be worth mentioning the example in
> samples/seccomp/user-trap.c as well.

Oh -- I meant to do that! Thanks for the reminding me.

Thanks,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/