Re: [PATCH v10 11/11] Documentation: prctl/seccomp_filter

From: Will Drewry
Date: Tue Feb 21 2012 - 22:41:53 EST


On Tue, Feb 21, 2012 at 3:12 PM, Kees Cook <keescook@xxxxxxxxxxxx> wrote:
> Hi,
>
> I've collected the initial no-new-privs patches, and this whole series
> and pushed it here so I could more easily review it:
> http://git.kernel.org/?p=linux/kernel/git/kees/linux.git;a=shortlog;h=refs/heads/seccomp
>
> Some minor tweaks below...
>
> On Tue, Feb 21, 2012 at 11:30:35AM -0600, Will Drewry wrote:
>> Documents how system call filtering using Berkeley Packet
>> Filter programs works and how it may be used.
>> Includes an example for x86 (32-bit) and a semi-generic
>> example using a macro-based code generator.
>>
>> v10: - update for SIGSYS
>>      - update for new seccomp_data layout
>>      - update for ptrace option use
>> v9: - updated bpf-direct.c for SIGILL
>> v8: - add PR_SET_NO_NEW_PRIVS to the samples.
>> v7: - updated for all the new stuff in v7: TRAP, TRACE
>>     - only talk about PR_SET_SECCOMP now
>>     - fixed bad JLE32 check (coreyb@xxxxxxxxxxxxxxxxxx)
>>     - adds dropper.c: a simple system call disabler
>> v6: - tweak the language to note the requirement of
>>       PR_SET_NO_NEW_PRIVS being called prior to use. (luto@xxxxxxx)
>> v5: - update sample to use system call arguments
>>     - adds a "fancy" example using a macro-based generator
>>     - cleaned up bpf in the sample
>>     - update docs to mention arguments
>>     - fix prctl value (eparis@xxxxxxxxxx)
>>     - language cleanup (rdunlap@xxxxxxxxxxxx)
>> v4: - update for no_new_privs use
>>     - minor tweaks
>> v3: - call out BPF <-> Berkeley Packet Filter (rdunlap@xxxxxxxxxxxx)
>>     - document use of tentative always-unprivileged
>>     - guard sample compilation for i386 and x86_64
>> v2: - move code to samples (corbet@xxxxxxx)
>>
>> Signed-off-by: Will Drewry <wad@xxxxxxxxxxxx>
>> ---
>>  Documentation/prctl/seccomp_filter.txt |  157 +++++++++++++++++++++
>>  samples/Makefile                       |    2 +-
>>  samples/seccomp/Makefile               |   31 ++++
>>  samples/seccomp/bpf-direct.c           |  150 ++++++++++++++++++++
>>  samples/seccomp/bpf-fancy.c            |  102 ++++++++++++++
>>  samples/seccomp/bpf-helper.c           |   89 ++++++++++++
>>  samples/seccomp/bpf-helper.h           |  236 ++++++++++++++++++++++++++++++++
>>  samples/seccomp/dropper.c              |   68 +++++++++
>>  8 files changed, 834 insertions(+), 1 deletions(-)
>>  create mode 100644 Documentation/prctl/seccomp_filter.txt
>>  create mode 100644 samples/seccomp/Makefile
>>  create mode 100644 samples/seccomp/bpf-direct.c
>>  create mode 100644 samples/seccomp/bpf-fancy.c
>>  create mode 100644 samples/seccomp/bpf-helper.c
>>  create mode 100644 samples/seccomp/bpf-helper.h
>>  create mode 100644 samples/seccomp/dropper.c
>>
>> diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
>> new file mode 100644
>> index 0000000..7de865b
>> --- /dev/null
>> +++ b/Documentation/prctl/seccomp_filter.txt
>> @@ -0,0 +1,157 @@
>> +             SECure COMPuting with filters
>> +             =============================
>> +
>> +Introduction
>> +------------
>> +
>> +A large number of system calls are exposed to every userland process
>> +with many of them going unused for the entire lifetime of the process.
>> +As system calls change and mature, bugs are found and eradicated.  A
>> +certain subset of userland applications benefit by having a reduced set
>> +of available system calls.  The resulting set reduces the total kernel
>> +surface exposed to the application.  System call filtering is meant for
>> +use with those applications.
>> +
>> +Seccomp filtering provides a means for a process to specify a filter for
>> +incoming system calls.  The filter is expressed as a Berkeley Packet
>> +Filter (BPF) program, as with socket filters, except that the data
>> +operated on is related to the system call being made: system call
>> +number and the system call arguments.  This allows for expressive
>> +filtering of system calls using a filter program language with a long
>> +history of being exposed to userland and a straightforward data set.
>> +
>> +Additionally, BPF makes it impossible for users of seccomp to fall prey
>> +to time-of-check-time-of-use (TOCTOU) attacks that are common in system
>> +call interposition frameworks.  BPF programs may not dereference
>> +pointers which constrains all filters to solely evaluating the system
>> +call arguments directly.
>> +
>> +What it isn't
>> +-------------
>> +
>> +System call filtering isn't a sandbox.  It provides a clearly defined
>> +mechanism for minimizing the exposed kernel surface.  It is meant to be
>> +a tool for sandbox developers to use.  Beyond that, policy for logical
>> +behavior and information flow should be managed with a combination of
>> +other system hardening techniques and, potentially, an LSM of your
>> +choosing.  Expressive, dynamic filters provide further options down this
>> +path (avoiding pathological sizes or selecting which of the multiplexed
>> +system calls in socketcall() is allowed, for instance) which could be
>> +construed, incorrectly, as a more complete sandboxing solution.
>> +
>> +Usage
>> +-----
>> +
>> +An additional seccomp mode is added and is enabled using the same
>> +prctl(2) call as the strict seccomp.  If the architecture has
>> +CONFIG_HAVE_ARCH_SECCOMP_FILTER, then filters may be added as below:
>> +
>> +PR_SET_SECCOMP:
>> +     Now takes an additional argument which specifies a new filter
>> +     using a BPF program.
>> +     The BPF program will be executed over struct seccomp_data
>> +     reflecting the system call number, arguments, and other
>> +     metadata.  The BPF program must then return one of the
>> +     acceptable values to inform the kernel which action should be
>> +     taken.
>> +
>> +     Usage:
>> +             prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog);
>> +
>> +     The 'prog' argument is a pointer to a struct sock_fprog which
>> +     will contain the filter program.  If the program is invalid, the
>> +     call will return -1 and set errno to EINVAL.
>> +
>> +     Note, is_compat_task is also tracked for the @prog.  This means
>> +     that once set the calling task will have all of its system calls
>> +     blocked if it switches its system call ABI.
>> +
>> +     If fork/clone and execve are allowed by @prog, any child
>> +     processes will be constrained to the same filters and system
>> +     call ABI as the parent.
>> +
>> +     Prior to use, the task must call prctl(PR_SET_NO_NEW_PRIVS, 1) or
>> +     run with CAP_SYS_ADMIN privileges in its namespace.  If these are not
>> +     true, -EACCES will be returned.  This requirement ensures that filter
>> +     programs cannot be applied to child processes with greater privileges
>> +     than the task that installed them.
>> +
>> +     Additionally, if prctl(2) is allowed by the attached filter,
>> +     additional filters may be layered on which will increase evaluation
>> +     time, but allow for further decreasing the attack surface during
>> +     execution of a process.
>> +
>> +The above call returns 0 on success and non-zero on error.
>> +
>> +Return values
>> +-------------
>> +
>> +A seccomp filter may return any of the following values:
>> +     SECCOMP_RET_ALLOW, SECCOMP_RET_KILL, SECCOMP_RET_TRAP,
>> +     SECCOMP_RET_ERRNO, or SECCOMP_RET_TRACE.
>> +
>> +SECCOMP_RET_ALLOW:
>> +     If all filters for a given task return this value then
>> +     the system call will proceed normally.
>> +
>> +SECCOMP_RET_KILL:
>> +     If any filters for a given take return this value then
>> +     the task will exit immediately without executing the system
>> +     call.
>> +
>> +SECCOMP_RET_TRAP:
>> +     If any filters specify SECCOMP_RET_TRAP and none of them
>> +     specify SECCOMP_RET_KILL, then the kernel will send a SIGTRAP
>> +     signal to the task and not execute the system call.  The kernel
>> +     will rollback the register state to just before system call
>> +     entry such that a signal handler in the process will be able
>> +     to inspect the ucontext_t->uc_mcontext registers and emulate
>> +     system call success or failure upon return from the signal
>> +     handler.
>> +
>> +     The SIGTRAP is differentiated by other SIGTRAPS by a si_code
>> +     of TRAP_SECCOMP.
>
> This should reflect the SIGTRAP->SIGSYS change (and SYS_SECCOMP si_code
> change).

Oops - yup.

>> +
>> +SECCOMP_RET_ERRNO:
>> +     If returned, the value provided in the lower 16-bits is
>> +     returned to userland as the errno and the system call is
>> +     not executed.
>
> The other sections each say "If any" or "If all" to clarify their
> behavior with multiple filters. The same should be done here, but more
> comments below. Additionally, it should clarify that on multiple
> uses of RET_ERRNO, the lower of the errnos will be returned.

I might drop all of the written out precedence verbiage since your
layout is more intuitive without it I think.

>> +
>> +SECCOMP_RET_TRACE:
>> +     If any filters return this value and the others return
>> +     SECCOMP_RET_ALLOW, then the kernel will attempt to notify
>> +     a ptrace()-based tracer prior to executing the system call.
>> +
>> +     A tracer will be notified if it requests PTRACE_O_TRACESECCOMP
>> +     via PTRACE_SETOPTIONS.  Otherwise, the system call will
>> +     not execute and -ENOSYS will be returned to userspace.
>> +
>> +     If the tracer ignores notification, then the system call will
>> +     proceed normally.  Changes to the registers will function
>> +     similarly to PTRACE_SYSCALL.  Additionally, if the tracer
>> +     detaches during notification or just after, the task may be
>> +     terminated as precautionary measure.
>> +
>> +Please note that the order of precedence is as follows:
>> +SECCOMP_RET_KILL, SECCOMP_RET_ERRNO, SECCOMP_RET_TRAP,
>> +SECCOMP_RET_TRACE, SECCOMP_RET_ALLOW.
>> +
>> +If multiple filters exist, the return value for the evaluation of a given
>> +system call will always use the highest precedent value.
>> +SECCOMP_RET_KILL will always take precedence.
>
> I think this clarification about precedence is good but should be at the
> head of the "Return values" section, and the sections ordered from that
> perspective, so that the "highest precedent value" aspect is a little
> bit easier to follow:
>
>
> Return values
> -------------
> A seccomp filter may return any of the following values. If multiple
> filters exist, the return value for the evaluation of a given system
> call will always use the highest precedent value. (For example,
> SECCOMP_RET_KILL will always take precedence.)
>
> In precedence order, they are:
>
> SECCOMP_RET_KILL:
>        If any filters for a given take return this value then
>        the task will exit immediately without executing the system
>        call.
>
> SECCOMP_RET_TRAP:
>        If any filters specify SECCOMP_RET_TRAP and none of them
>        specify SECCOMP_RET_KILL, then the kernel will send a SIGSYS
>        signal to the task and not execute the system call. The kernel
>        will rollback the register state to just before system call
>        entry such that a signal handler in the process will be able
>        to inspect the ucontext_t->uc_mcontext registers and emulate
>        system call success or failure upon return from the signal
>        handler.
>
>        The SIGSYS is differentiated by other SIGSYS signals by a si_code
>        of SYS_SECCOMP.
>
> SECCOMP_RET_ERRNO:
>        If any filters return this value and none of them specify a
>        higher precedence value, then the lowest of the values provided
>        in the lower 16-bits is returned to userland as the errno and
>        the system call is not executed.
>
> SECCOMP_RET_TRACE:
>        If any filters return this value and none of them specify a
>        higher precedence value, then the kernel will attempt to notify
>        a ptrace()-based tracer prior to executing the system call.
>
>        A tracer will be notified if it requests PTRACE_O_TRACESECCOMP
>        via PTRACE_SETOPTIONS. Otherwise, the system call will
>        not execute and -ENOSYS will be returned to userspace.
>        If the tracer ignores notification, then the system call will
>        proceed normally. Changes to the registers will function
>        similarly to PTRACE_SYSCALL. Additionally, if the tracer
>        detaches during notification or just after, the task may be
>        terminated as precautionary measure.
>
> SECCOMP_RET_ALLOW:
>        If all filters for a given task return this value then
>        the system call will proceed normally.
>

Thanks! I'll integrate all of this and post a full v11 series in the
morning (depending on any feedback trickling later :).

cheers,
will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/