Re: [PATCH 1/6] seccomp: Introduce SECCOMP_PIN_ARCHITECTURE

From: Jann Horn
Date: Wed Sep 23 2020 - 20:42:31 EST


On Thu, Sep 24, 2020 at 1:29 AM Kees Cook <keescook@xxxxxxxxxxxx> wrote:
> For systems that provide multiple syscall maps based on audit
> architectures (e.g. AUDIT_ARCH_X86_64 and AUDIT_ARCH_I386 via
> CONFIG_COMPAT) or via syscall masks (e.g. x86_x32), allow a fast way
> to pin the process to a specific syscall table, instead of needing
> to generate all filters with an architecture check as the first filter
> action.
>
> This creates the internal representation that seccomp itself can use
> (which is separate from the filters, which need to stay runtime
> agnostic). Additionally paves the way for constant-action bitmaps.

I don't really see the point in providing this UAPI - the syscall
number checking will probably have much more performance cost than the
architecture number check, and it's not like this lets us avoid the
check, we're just moving it over into C code.

> Signed-off-by: Kees Cook <keescook@xxxxxxxxxxxx>
> ---
> include/linux/seccomp.h | 9 +++
> include/uapi/linux/seccomp.h | 1 +
> kernel/seccomp.c | 79 ++++++++++++++++++-
> tools/testing/selftests/seccomp/seccomp_bpf.c | 33 ++++++++
> 4 files changed, 120 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index 02aef2844c38..0be20bc81ea9 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -20,12 +20,18 @@
> #include <linux/atomic.h>
> #include <asm/seccomp.h>
>
> +#define SECCOMP_ARCH_IS_NATIVE 1
> +#define SECCOMP_ARCH_IS_COMPAT 2

FYI, mips has three different possible "arch" values (per kernel build
config; the __AUDIT_ARCH_LE flag can also be set, but that's fixed
based on the config):

- AUDIT_ARCH_MIPS
- AUDIT_ARCH_MIPS | __AUDIT_ARCH_64BIT
- AUDIT_ARCH_MIPS | __AUDIT_ARCH_64BIT | __AUDIT_ARCH_CONVENTION_MIPS64_N32

But I guess we can deal with that once someone wants to actually add
support for this on mips.

> +#define SECCOMP_ARCH_IS_MULTIPLEX 3

Why should X32 be handled specially? If the seccomp filter allows
specific syscalls (as it should), we don't have to care about X32.
Only in weird cases where the seccomp filter wants to deny specific
syscalls (a horrible idea), X32 is a concern, and in such cases, the
userspace code can generate a single conditional jump to deal with it.

And when seccomp is used properly to allow specific syscalls, the
kernel will just waste time uselessly checking this X32 stuff.

[...]
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
[...]
> +static long seccomp_pin_architecture(void)
> +{
> +#ifdef SECCOMP_ARCH
> + struct task_struct *task = current;
> +
> + u8 arch = seccomp_get_arch(syscall_get_arch(task),
> + syscall_get_nr(task, task_pt_regs(task)));
> +
> + /* How did you even get here? */

Via a racing TSYNC, that's how.

> + if (task->seccomp.arch && task->seccomp.arch != arch)
> + return -EBUSY;
> +
> + task->seccomp.arch = arch;
> +#endif
> + return 0;
> +}

Why does this return 0 if SECCOMP_ARCH is not defined? That suggests
to userspace that we have successfully pinned the ABI, even though
we're actually unable to do so.