Re: seccomp feature development

From: Jann Horn
Date: Mon May 18 2020 - 18:40:11 EST


On Mon, May 18, 2020 at 11:05 PM Kees Cook <keescook@xxxxxxxxxxxx> wrote:
> ## deep argument inspection
>
> Background: seccomp users would like to write filters that traverse
> the user pointers passed into many syscalls, but seccomp can't do this
> dereference for a variety of reasons (mostly involving race conditions and
> rearchitecting the entire kernel syscall and copy_from_user() code flows).

Also, other than for syscall entry, it might be worth thinking about
whether we want to have a special hook into seccomp for io_uring.
io_uring is growing support for more and more syscalls, including
things like openat2, connect, sendmsg, splice and so on, and that list
is probably just going to grow in the future. If people start wanting
to use io_uring in software with seccomp filters, it might be
necessary to come up with some mechanism to prevent io_uring from
permitting access to almost everything else...

Probably not a big priority for now, but something to keep in mind for
the future.

[...]
> The argument caching bit is, I think, rather mechanical in nature since
> it's all "just" internal to the kernel: seccomp can likely adjust how it
> allocates seccomp_data (maybe going so far as to have it split across two
> pages with the syscall argument struct always starting on the 2nd page
> boundary), and copying the EA struct into that page, which will be both
> used by the filter and by the syscall.

We could also do the same kind of thing the eBPF verifier does in
convert_ctx_accesses(), and rewrite the context accesses to actually
go through two different pointers depending on the (constant) offset
into seccomp_data.

> I imagine state tracking ("is
> there a cached EA?", "what is the address of seccomp_data?", "what is
> the address of the EA?") can be associated with the thread struct.

You probably mean the task struct?

> The growing size of the EA struct will need some API design. For filters
> to operate on the contiguous seccomp_data+EA struct, the filter will
> need to know how large seccomp_data is (more on this later), and how
> large the EA struct is. When the filter is written in userspace, it can
> do the math, point into the expected offsets, and get what it needs. For
> this to work correctly in the kernel, though, the seccomp BPF verifier
> needs to know the size of the EA struct as well, so it can correctly
> perform the offset checking (as it currently does for just the
> seccomp_data struct size).
>
> Since there is not really any caller-based "seccomp state" associated
> across seccomp(2) calls, I don't think we can add a new command to tell
> the kernel "I'm expecting the EA struct size to be $foo bytes", since
> the kernel doesn't track who "I" is besides just being "current", which
> doesn't take into account the thread lifetime -- if a process launcher
> knows about one size and the child knows about another, things will get
> confused. The sizes really are just associated with individual filters,
> based on the syscalls they're examining. So, I have thoughts on possible
> solutions:
>
> - create a new seccomp command SECCOMP_SET_MODE_FILTER2 which uses the
> EA style so we can pass in more than a filter and include also an
> array of syscall to size mappings. (I don't like this...)
> - create a new filter flag, SECCOMP_FILTER_FLAG_EXTENSIBLE, which changes
> the meaning of the uarg from "filter" to a EA-style structure with
> sizes and pointers to the filter and an array of syscall to size
> mappings. (I like this slightly better, but I still don't like it.)
> - leverage the EA design and just accept anything <= PAGE_SIZE, record
> the "max offset" value seen during filter verification, and zero-fill
> the EA struct with zeros to that size when constructing the
> seccomp_data + EA struct that the filter will examine. Then the seccomp
> filter doesn't care what any of the sizes are, and userspace doesn't
> care what any of the sizes are. (I like this as it makes the problems
> to solve contained entirely by the seccomp infrastructure and does not
> touch user API, but I worry I'm missing some gotcha I haven't
> considered.)

We don't need to actually zero-fill memory for this beyond what the
kernel supports - AFAIK the existing APIs already say that passing a
short length is equivalent to passing zeroes, so we can just replace
all out-of-bounds loads with zeroing registers in the filter.
The tricky case is what should happen if the userspace program passes
in fields that the filter doesn't know about. The filter can see the
length field passed in by userspace, and then just reject anything
where the length field is bigger than the structure size the filter
knows about. But maybe it'd be slightly nicer if there was an
operation for "tell me whether everything starting at offset X is
zeroes", so that if someone compiles with newer kernel headers where
the struct is bigger, and leaves the new fields zeroed, the syscalls
still go through an outdated filter properly.

> And then, my age-old concern, that maybe doesn't need a solution... I
> remain plagued by the lack of pathname inspection. But I think the
> ToCToU nature of it means we just cannot do it from seccomp. It does
> make filtering openat2()'s EA struct a bit funny... a filter has no idea
> what path it applies to... but that doesn't matter because the object
> the path points to might change[6] during the syscall. Argh.

I don't think seccomp needs to care about paths. Instead, you can use
one of these three approaches:

1) You can make openat2() the only syscall that is allowed to take
non-empty path arguments, and restrict it to
RESOLVE_BENEATH|RESOLVE_IN_ROOT. (For APIs that use AT_EMPTY_PATH, we
can probably come up with some way to say "this part must be an empty
string" - e.g. by defining a new bogus placeholder pointer that you
can use as "empty path", or something like that.) This is basically
like the old capsicum O_BENEATH stuff, except with seccomp doing the
enforcement that you're not using absolute paths or things like that.
2) You can create a new mount namespace, then use open_tree() with
OPEN_TREE_CLONE to create file descriptors to ephemeral bind mounts,
then sandbox yourself with pivot_root().
3) You can use Mickael's landlock, once that's landed.

In particular the first and second option are things you can already
do today - although the first one requires that you have a seccomp
filter that prevents the program from overwriting/unmapping/... the
chunk of memory that contains the openat2 argument structure, so that
you can then whitelist the argument structure's address in the filter.

> ## changing structure sizes
>
> Background: there have been regular needs to add things to various
> seccomp structures. Each come with their own unique pains, and solving
> this as completely as possible in a future-proof way would be very nice.
>
> As noted in "fd passing" above, there is a desire to add some useful
> things to the user_notif struct (e.g. thread group pid). Similarly,
> there have been requests in the past (though I can't find any good
> references right now, just my indirect comments[3]) to grow seccomp_data.

This thing (which hasn't landed so far, but would be a really awesome
feature) needed to add stuff to seccomp_data:
<https://lore.kernel.org/linux-api/20181029112343.27454-1-msammler@xxxxxxxxxxx/>

> Combined with the EA struct work above, there's a clear need for seccomp
> to reexamine how it deals with its API structures (and how this
> interacts with filters).
>
> First, let's consider seccomp_data. If we grow it, the EA struct offset
> will move, based on the deep arg inspection design above. Alternatively,
> we could instead put seccomp_data offset 0, and EA struct at offset
> PAGE_SIZE, and treat seccomp_data itself as an EA struct where we let
> the filter access whatever it thinks is there, with it being zero-filled
> by the kernel. For any values where 0 is valid, there will just need to
> be a "is that field valid?" bit before it:
>
> unsigned long feature_bits;
> unsigned long interesting_thing_1;
> unsigned long interesting_thing_2;
> unsigned long interesting_thing_3;
> ...
>
> and the filter would check feature_bits...

(Apart from the user_notif stuff, those feature bits would not
actually have to exist in memory; they could be inlined while loading
the program. Actually, not even the registers would have to exist in a
seccomp_data struct in memory, we could just replace the loads with
reads from the pt_regs, too.)

> (However, this needs to be carefully considered given that seccomp_data
> is embedded in user_notif... should the EA struct from userspace also be
> copied into user_notif? More thoughts on this below...)
>
> For user_notif, I think we need something in and around these options:
>
> - make a new API that explicitly follows EA struct design
> (and while read()/write() might be easier[4], I tend to agree with
> Jann and we need to stick to ioctl(): as Tycho noted, "read/write is
> for data". Though I wonder if read() could be used for the notifications,
> which ARE data, and use ioctl() for the responses?)

Just as a note: If we use read() there, we'll never be able to
transfer things like FDs through that API.

> ## syscall bitmasks

YES PLEASE

> Background: the number one bit of feedback on seccomp has been performance
> concerns, especially for fast syscalls like futex(). When looking at
> where time is spent, it is very clearly spent running the filters (which
> I found surprising, given that adding TIF_SECCOMP tended to trip the
> "slow path" syscall path (though most architectures these days just
> don't have a fast path any more thanks to Meltdown). It would be nice
> to make filtering faster without running BPF at all. :)
>
> Nearly every thread on adding eBPF, for example, has been about trying to
> speed up the if/then nature of BPF for finding a syscall that the filter
> wants to always accept (or reject). The bulk of most seccomp filters
> are pretty simple, usually they're either "reject everything except
> $set-of-syscalls", or "accept everything except $set-of-syscalls". The
> stuff in between tends to be a mix, like "accept $some, process $these
> with argument checks, and reject $remaining".
>
> In all three cases, the entire seccomp() path could be sped up by having
> a syscall bitmask that could be applied before the filters are ever run,
> with 3 (actually 2) syscall bitmasks: accept, reject, or process. If
> a syscall was in the "accept" bitmask, we immediately exit seccomp and
> continue. Otherwise, if it's in the "reject" bitmask, we mark it rejected
> and immediately exit seccomp. And finally, we run the filters. In all
> ways, doing bitmask math is going to be faster than running the BPF. ;)
>
> So how would the API for this work? I have two thoughts, and I don't
> think they're exclusive:
>
> - new API for "add this syscall to the reject bitmask". We can't really
> do an "accept" bitmask addition without processing the attached
> filters...
> - process attached filters! Each time a filter is added, have the
> BPF verifier do an analysis to determine if there are any static
> results, and set bits in the various bitmasks to represent it.
> i.e. when seccomp is first enabled on a thread, the "accept"
> bitmask is populated with all syscalls, and for each filter, do
> [math,simulation,magic] and knock each syscall out of "accept" if
> it isn't always accepted, and further see if there are any syscalls
> that are always rejected, and mark those in the "reject" bitmask.

Other options:
- add a "load from read-only memory" opcode and permit specifying the
data that should be in that memory when loading the filter
- make the seccomp API take an array of (syscall-number,
instruction-offset) tuples, and start evaluation of the filter at an
offset if it's one of those syscalls

One more thing that would be really nice: Is there a way we can have
64-bit registers in our seccomp filters? At the moment, every
comparison has to turn into three ALU ops, which is pretty messy;
libseccomp got that wrong (<https://crbug.com/project-zero/1769>), and
it contributes to the horrific code Chrome's BPF generator creates.
Here's some pseudocode from my hacky BPF disassembler, which shows
pretty much just the filter code for filtering getpriority() and
setpriority() in a Chrome renderer, with tons of useless dead code:

0139 if args[0].high == 0x00000000: [true +3, false +0]
013e if args[0].low != 0x00000000: [true +157, false +0] -> ret TRAP
0140 if args[1].high == 0x00000000: [true +3, false +0]
0145 if args[1].low == 0x00000000: [true +149, false +0]
-> ret ALLOW (syscalls: getpriority, setpriority)
0147 if args[1].high == 0x00000000: [true +3, false +0]
014c if args[1].low == 0x00000001: [true +142, false
+141] -> ret ALLOW (syscalls: getpriority, setpriority)
01da ret ERRNO
0148 if args[1].high != 0xffffffff: [true +142, false +0]
-> ret TRAP
014a if args[1].low NO-COMMON-BITS 0x80000000: [true +140,
false +0] -> ret TRAP
014c if args[1].low == 0x00000001: [true +142, false +141]
-> ret ALLOW (syscalls: getpriority, setpriority)
01da ret ERRNO
0141 if args[1].high != 0xffffffff: [true +149, false +0] -> ret TRAP
0143 if args[1].low NO-COMMON-BITS 0x80000000: [true +147,
false +0] -> ret TRAP
0145 if args[1].low == 0x00000000: [true +149, false +0] ->
ret ALLOW (syscalls: getpriority, setpriority)
0147 if args[1].high == 0x00000000: [true +3, false +0]
014c if args[1].low == 0x00000001: [true +142, false +141]
-> ret ALLOW (syscalls: getpriority, setpriority)
01da ret ERRNO
0148 if args[1].high != 0xffffffff: [true +142, false +0] -> ret TRAP
014a if args[1].low NO-COMMON-BITS 0x80000000: [true +140,
false +0] -> ret TRAP
014c if args[1].low == 0x00000001: [true +142, false +141]
-> ret ALLOW (syscalls: getpriority, setpriority)
01da ret ERRNO
013a if args[0].high != 0xffffffff: [true +156, false +0] -> ret TRAP
013c if args[0].low NO-COMMON-BITS 0x80000000: [true +154,
false +0] -> ret TRAP
013e if args[0].low != 0x00000000: [true +157, false +0] -> ret TRAP
0140 if args[1].high == 0x00000000: [true +3, false +0]
0145 if args[1].low == 0x00000000: [true +149, false +0] ->
ret ALLOW (syscalls: getpriority, setpriority)
0147 if args[1].high == 0x00000000: [true +3, false +0]
014c if args[1].low == 0x00000001: [true +142, false +141]
-> ret ALLOW (syscalls: getpriority, setpriority)
01da ret ERRNO
0148 if args[1].high != 0xffffffff: [true +142, false +0] -> ret TRAP
014a if args[1].low NO-COMMON-BITS 0x80000000: [true +140,
false +0] -> ret TRAP
014c if args[1].low == 0x00000001: [true +142, false +141]
-> ret ALLOW (syscalls: getpriority, setpriority)
01da ret ERRNO
0141 if args[1].high != 0xffffffff: [true +149, false +0] -> ret TRAP
0143 if args[1].low NO-COMMON-BITS 0x80000000: [true +147,
false +0] -> ret TRAP
0145 if args[1].low == 0x00000000: [true +149, false +0] ->
ret ALLOW (syscalls: getpriority, setpriority)
0147 if args[1].high == 0x00000000: [true +3, false +0]
014c if args[1].low == 0x00000001: [true +142, false +141]
-> ret ALLOW (syscalls: getpriority, setpriority)
01da ret ERRNO
0148 if args[1].high != 0xffffffff: [true +142, false +0] -> ret TRAP
014a if args[1].low NO-COMMON-BITS 0x80000000: [true +140,
false +0] -> ret TRAP
014c if args[1].low == 0x00000001: [true +142, false +141] ->
ret ALLOW (syscalls: getpriority, setpriority)
01da ret ERRNO

which is generated by this little snippet of C++ code:

ResultExpr RestrictGetSetpriority(pid_t target_pid) {
const Arg<int> which(0);
const Arg<int> who(1);
return If(which == PRIO_PROCESS,
Switch(who).CASES((0, target_pid), Allow()).Default(Error(EPERM)))
.Else(CrashSIGSYS());
}

On anything other than 32-bit MIPS, 32-bit powerpc and 32-bit sparc,
we're actually already using the eBPF backend infrastructure... so
maybe it would be an option to keep enforcing basically the same rules
that we currently have for cBPF, but use the eBPF instruction format?