Re: [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls

From: YiFei Zhu
Date: Mon Sep 21 2020 - 11:28:32 EST


On Mon, Sep 21, 2020 at 8:51 AM Tycho Andersen <tycho@tycho.pizza> wrote:
> One problem with a kernel config setting is that it's for all tasks.
> While docker and systemd may make decsisions based on syscall number,
> other applications may have more nuanced filters, and this cache would
> yield incorrect results.
>
> You could work around this by making this a filter flag instead;
> filter authors would generally know whether their filter results can
> be cached and probably be motivated to opt in if their users are
> complaining about slow syscall execution.
>
> Tycho

Yielding incorrect results should not be possible. The purpose of the
"emulator" (for the lack of a better term) is to determine whether the
filter reads any syscall arguments. A read from a syscall argument
must go through the BPF_LD | BPF_ABS instruction, where the 32 bit
multiuse field "k" is an offset to struct seccomp_data.

struct seccomp_data contains four components [1]: syscall number,
architecture number, instruction pointer at the time of syscall, and
syscall arguments. The syscall number is enumerated by the emulator.
The arch number is treated by the cache as 'if arch number is
different from cached arch number, flush cache' (this is in
seccomp_cache_check). The last two (ip and args) are treated exactly
the same way in this patch: if the filter loads the arguments at all,
the syscall is marked non-cacheable for any architecture number.

The struct seccomp_data is the only external thing the filter may
access. It is also cBPF so it cannot contain maps to store special
states between runs. Therefore a seccomp filter is pure function. If
we know given some inputs (the syscall number and arch number) the
function will not evaluate any other inputs before returning, then we
can safely cache with just the inputs in concern.

As for the overhead, on my x86_64 with gcc 10.2.0, seccomp_cache_check
compiles into:

if (unlikely(syscall_nr < 0 || syscall_nr >= NR_syscalls))
return false;
0xffffffff8120fdb3 <+99>: movsxd rdx,DWORD PTR [r12]
0xffffffff8120fdb7 <+103>: cmp edx,0x1b7
0xffffffff8120fdbd <+109>: ja 0xffffffff8120fdf9 <__seccomp_filter+169>
if (unlikely(thread_data->last_filter != sfilter ||
thread_data->last_arch != sd->arch)) {
0xffffffff8120fdbf <+111>: mov rdi,QWORD PTR [rbp-0xb8]
0xffffffff8120fdc6 <+118>: lea rsi,[rax+0x6f0]
0xffffffff8120fdcd <+125>: cmp rdi,QWORD PTR [rax+0x728]
0xffffffff8120fdd4 <+132>: jne 0xffffffff812101f0 <__seccomp_filter+1184>
0xffffffff8120fdda <+138>: mov ebx,DWORD PTR [r12+0x4]
0xffffffff8120fddf <+143>: cmp DWORD PTR [rax+0x730],ebx
0xffffffff8120fde5 <+149>: jne 0xffffffff812101f0 <__seccomp_filter+1184>
return test_bit(syscall_nr, thread_data->syscall_ok);
0xffffffff8120fdeb <+155>: bt QWORD PTR [rax+0x6f0],rdx
0xffffffff8120fdf3 <+163>: jb 0xffffffff8120ffb7 <__seccomp_filter+615>
[... unlikely path of cache flush omitted]

and seccomp_cache_insert compiles into:

if (unlikely(syscall_nr < 0 || syscall_nr >= NR_syscalls))
return;
0xffffffff8121021b <+1227>: movsxd rax,DWORD PTR [r12]
0xffffffff8121021f <+1231>: cmp eax,0x1b7
0xffffffff81210224 <+1236>: ja 0xffffffff8120ffb7 <__seccomp_filter+615>
if (!test_bit(syscall_nr, sfilter->cache.syscall_ok))
return;
0xffffffff8121022a <+1242>: mov rbx,QWORD PTR [rbp-0xb8]
0xffffffff81210231 <+1249>: mov rdx,QWORD PTR gs:0x17000
0xffffffff8121023a <+1258>: bt QWORD PTR [rbx+0x108],rax
0xffffffff81210242 <+1266>: jae 0xffffffff8120ffb7 <__seccomp_filter+615>
set_bit(syscall_nr, thread_data->syscall_ok);
0xffffffff81210248 <+1272>: lock bts QWORD PTR [rdx+0x6f0],rax
0xffffffff81210251 <+1281>: jmp 0xffffffff8120ffb7 <__seccomp_filter+615>

In the circumstance of a non-cacheable syscall happening over and
over, the code path would go through the syscall_nr bound check, then
the filter flush check, then the test_bit, then another syscall_nr
bound check and one more test_bit in seccomp_cache_insert. Considering
that they are either stack variables, elements of current task_struct,
and elements of the filter struct, I imagine they would well be in the
CPU data cache and not incur much overhead. The CPU is also free to
branch predict and reorder memory accesses (there are no hardware
memory barriers here) to further increase the efficiency, whereas a
normal filter execution would be impaired by things like retpoline.

If one were to add an additional flag for
does-userspace-want-us-to-cache, it would still be a member of the
filter struct. What would be loaded into the CPU data cache originally
would still be loaded. Correct me if I'm wrong, but I don't think that
check will reduce any significant overhead of the seccomp cache
itself.

That said, I have not profiled the exact number of milliseconds this
patch would incur to uncacheable syscalls, I can report back with
numbers if you would like to see.

Does that answer your concern?

YiFei Zhu

[1] https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/seccomp.h#L60