Re: [RFC PATCH seccomp 1/2] seccomp/cache: Add "emulator" to check if filter is arg-dependent

From: Jann Horn
Date: Mon Sep 21 2020 - 13:47:36 EST


On Mon, Sep 21, 2020 at 7:35 AM YiFei Zhu <zhuyifei1999@xxxxxxxxx> wrote:
> SECCOMP_CACHE_NR_ONLY will only operate on syscalls that do not
> access any syscall arguments or instruction pointer. To facilitate
> this we need a static analyser to know whether a filter will
> access. This is implemented here with a pseudo-emulator, and
> stored in a per-filter bitmap. Each seccomp cBPF instruction,
> aside from ALU (which should rarely be used in seccomp), gets a
> naive best-effort emulation for each syscall number.
>
> The emulator works by following all possible (without SAT solving)
> paths the filter can take. Every cBPF register / memory position
> records whether that is a constant, and of so, the value of the
> constant. Loading from struct seccomp_data is considered constant
> if it is a syscall number, else it is an unknown. For each
> conditional jump, if the both arguments can be resolved to a
> constant, the jump is followed after computing the result of the
> condition; else both directions are followed, by pushing one of
> the next states to a linked list of next states to process. We
> keep a finite number of pending states to process.

Is this actually necessary, or can we just bail out on any branch that
we can't statically resolve?

struct seccomp_data only contains the syscall number (constant for a
given filter evaluation), the architecture number (also constant), the
instruction pointer (basically never used in seccomp filters), and the
syscall arguments. Any normal seccomp filter first branches on the
architecture, then branches on the syscall number, and then branches
on arguments if necessary.

This optimization could only be improved by the "follow both branches"
logic if a seccomp program branches on either the instruction pointer
or an argument *before* looking at the syscall number, and later comes
to the same conclusion on *both* sides of the check. It would have to
be something like:

if (instruction_pointer == 0xasdf1234) {
if (nr == mmap) return ACCEPT;
[...]
return KILL;
} else {
if (nr == mmap) return ACCEPT;
[...]
return KILL;
}

I've never seen anyone do something like this. And the proposed patch
would still bail out on such a filter because of the load from the
instruction_pointer field; I don't think it would even be possible to
reach a branch with an unknown condition with this patch. So I think
we should probably get rid of this extra logic for keeping track of
multiple execution states for now. That would make the code a lot
simpler.


Also: If it turns out that the time spent in seccomp_cache_prepare()
is measurable for large filters, a possible improvement would be to
keep track of the last syscall number for which the result would be
the same as for the current one, such that instead of evaluating the
filter for one instruction at a time, it would effectively be
evaluated for a range at a time. That should be pretty straightforward
to implement, I think.

> The emulation is halted if it reaches a return, or if it reaches a
> read from struct seccomp_data that reads an offset that is neither
> syscall number or architecture number. In the latter case, we mark
> the syscall number as not okay for seccomp to cache. If a filter
> depends on more filters, then if its dependee cannot process the
> syscall then the depender is also marked not to process the syscall.
>
> We also do a single pass on the entire filter instructions before
> performing emulation. If none of the filter instructions load from
> the troublesome offsets, then the filter is considered "trivial",
> and all syscalls are marked okay for seccomp to cache.
>
> Signed-off-by: YiFei Zhu <yifeifz2@xxxxxxxxxxxx>
> ---
> arch/x86/Kconfig | 27 ++++
> kernel/seccomp.c | 323 ++++++++++++++++++++++++++++++++++++++++++++++-
> 2 files changed, 349 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
[...]
> +choice
> + prompt "Seccomp filter cache"
> + default SECCOMP_CACHE_NONE

I think this should be on by default.

> + depends on SECCOMP
> + depends on SECCOMP_FILTER

SECCOMP_FILTER already depends on SECCOMP, so the "depends on SECCOMP"
line is unnecessary.

> + help
> + Seccomp filters can potentially incur large overhead for each
> + system call. This can alleviate some of the overhead.
> +
> + If in doubt, select 'none'.

This should not be in arch/x86. Other architectures, such as arm64,
should also be able to use this without extra work.

> +config SECCOMP_CACHE_NONE
> + bool "None"
> + help
> + No caching is done. Seccomp filters will be called each time
> + a system call occurs in a seccomp-guarded task.
> +
> +config SECCOMP_CACHE_NR_ONLY
> + bool "Syscall number only"
> + help
> + This is enables a bitmap to cache the results of seccomp
> + filters, if the filter allows the syscall and is independent
> + of the syscall arguments.

Maybe reword this as something like: "For each syscall number, if the
seccomp filter has a fixed result, store that result in a bitmap to
speed up system calls."

> This requires around 60 bytes per
> + filter and 70 bytes per task.
> +
> +endchoice
> +
> source "kernel/Kconfig.hz"
>
> config KEXEC
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 3ee59ce0a323..d8c30901face 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -143,6 +143,27 @@ struct notification {
> struct list_head notifications;
> };
>
> +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY
> +/**
> + * struct seccomp_cache_filter_data - container for cache's per-filter data
> + *
> + * @syscall_ok: A bitmap where each bit represent whether seccomp is allowed to

nit: represents

> + * cache the results of this syscall.
> + */
> +struct seccomp_cache_filter_data {
> + DECLARE_BITMAP(syscall_ok, NR_syscalls);
> +};
> +
> +#define SECCOMP_EMU_MAX_PENDING_STATES 64
> +#else
> +struct seccomp_cache_filter_data { };
> +
> +static inline int seccomp_cache_prepare(struct seccomp_filter *sfilter)
> +{
> + return 0;
> +}
> +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
[...]
> +/**
> + * seccomp_emu_step - step one instruction in the emulator
> + * @env: The emulator environment
> + * @state: The emulator state
> + *
> + * Returns 1 to halt emulation, 0 to continue, or -errno if error occurred.
> + */
> +static int seccomp_emu_step(struct seccomp_emu_env *env,
> + struct seccomp_emu_state *state)
> +{
> + struct sock_filter *ftest = &env->filter[state->pc++];
> + struct seccomp_emu_state *new_state;
> + u16 code = ftest->code;
> + u32 k = ftest->k;
> + u32 operand;
> + bool compare;
> + int reg_idx;
> +
> + switch (BPF_CLASS(code)) {
> + case BPF_LD:
> + case BPF_LDX:
> + reg_idx = BPF_CLASS(code) == BPF_LDX;
> +
> + switch (BPF_MODE(code)) {
> + case BPF_IMM:
> + state->reg_known[reg_idx] = true;
> + state->reg_const[reg_idx] = k;
> + break;
> + case BPF_ABS:
> + if (k == offsetof(struct seccomp_data, nr)) {
> + state->reg_known[reg_idx] = true;
> + state->reg_const[reg_idx] = env->nr;
> + } else {
> + state->reg_known[reg_idx] = false;

This is completely broken. This emulation logic *needs* to run with
the proper architecture identifier. (And for platforms like x86-64
that have compatibility support for a second ABI, the emulation should
probably also be done for that ABI, and there should be separate
bitmasks for that ABI.)

With the current logic, you will (almost) never actually have
permitted syscalls in the bitmask, because filters fundamentally have
to return different results for different ABIs - the syscall numbers
mean completely different things under different ABIs.

> + if (k != offsetof(struct seccomp_data, arch)) {
> + env->syscall_ok = false;
> + return 1;
> + }
> + }

This would read nicer as:

if (k == offsetof(struct seccomp_data, nr)) {

} else if (k == offsetof(struct seccomp_data, arch)) {

} else {
env->syscall_ok = false;
return 1;
}

> +
> + break;
> + case BPF_MEM:
> + state->reg_known[reg_idx] = state->reg_known[2 + k];
> + state->reg_const[reg_idx] = state->reg_const[2 + k];
> + break;
> + default:
> + state->reg_known[reg_idx] = false;
> + }
> +
> + return 0;
> + case BPF_ST:
> + case BPF_STX:
> + reg_idx = BPF_CLASS(code) == BPF_STX;
> +
> + state->reg_known[2 + k] = state->reg_known[reg_idx];
> + state->reg_const[2 + k] = state->reg_const[reg_idx];

I think we should probably just bail out if we see anything that's
BPF_ST/BPF_STX. I've never seen seccomp filters that actually use that
part of cBPF.

But in case we do need this, maybe instead of using "2 +" for all
these things, the cBPF memory slots should be in a separate array.

> + return 0;
> + case BPF_ALU:
> + state->reg_known[0] = false;
> + return 0;
> + case BPF_JMP:
> + if (BPF_OP(code) == BPF_JA) {
> + state->pc += k;
> + return 0;
> + }
> +
> + if (ftest->jt == ftest->jf) {
> + state->pc += ftest->jt;
> + return 0;
> + }

Why is this check here? Is anyone actually creating filters with such
obviously nonsensical branches? I know that there are highly ludicrous
filters out there, but I don't think I've ever seen this specific kind
of useless code.

> + if (!state->reg_known[0])
> + goto both_cases;
[...]
> +both_cases:
> + if (env->next_state_len >= SECCOMP_EMU_MAX_PENDING_STATES)
> + return -E2BIG;

Even if we cap the maximum number of pending states, this could still
run for an almost unbounded amount of time, I think. Which is bad. If
this code was actually necessary, we'd probably want to track
separately the total number of branches we've seen and so on.

But as I said, I think this code should just be removed instead.

[...]
> + }
> +}
[...]