Re: seccomp feature development
From: Aleksa Sarai
Date: Mon May 18 2020 - 22:49:05 EST
On 2020-05-19, Jann Horn <jannh@xxxxxxxxxx> wrote:
> On Mon, May 18, 2020 at 11:05 PM Kees Cook <keescook@xxxxxxxxxxxx> wrote:
> > ## deep argument inspection
> >
> > Background: seccomp users would like to write filters that traverse
> > the user pointers passed into many syscalls, but seccomp can't do this
> > dereference for a variety of reasons (mostly involving race conditions and
> > rearchitecting the entire kernel syscall and copy_from_user() code flows).
>
> Also, other than for syscall entry, it might be worth thinking about
> whether we want to have a special hook into seccomp for io_uring.
> io_uring is growing support for more and more syscalls, including
> things like openat2, connect, sendmsg, splice and so on, and that list
> is probably just going to grow in the future. If people start wanting
> to use io_uring in software with seccomp filters, it might be
> necessary to come up with some mechanism to prevent io_uring from
> permitting access to almost everything else...
>
> Probably not a big priority for now, but something to keep in mind for
> the future.
Indeed. Quite a few people have raised concerns about io_uring and its
debug-ability, but I agree that another less-commonly-mentioned concern
should be how you restrict io_uring(2) from doing operations you've
disallowed through seccomp. Though obviously user_notif shouldn't be
allowed. :D
> > The argument caching bit is, I think, rather mechanical in nature since
> > it's all "just" internal to the kernel: seccomp can likely adjust how it
> > allocates seccomp_data (maybe going so far as to have it split across two
> > pages with the syscall argument struct always starting on the 2nd page
> > boundary), and copying the EA struct into that page, which will be both
> > used by the filter and by the syscall.
>
> We could also do the same kind of thing the eBPF verifier does in
> convert_ctx_accesses(), and rewrite the context accesses to actually
> go through two different pointers depending on the (constant) offset
> into seccomp_data.
My main worry with this is that we'll need to figure out what kind of
offset mathematics are necessary to deal with pointers inside the
extensible struct. As a very ugly proposal, you could make it so that
you multiply the offset by PAGE_SIZE each time you want to dereference
the pointer at that offset (unless we want to add new opcodes to cBPF to
allow us to represent this).
This might even be needed for seccomp user_notif -- given one of the
recent proposals was basically to just add two (extensible) struct
pointers inside the main user_notif struct.
> > I imagine state tracking ("is
> > there a cached EA?", "what is the address of seccomp_data?", "what is
> > the address of the EA?") can be associated with the thread struct.
>
> You probably mean the task struct?
>
> > The growing size of the EA struct will need some API design. For filters
> > to operate on the contiguous seccomp_data+EA struct, the filter will
> > need to know how large seccomp_data is (more on this later), and how
> > large the EA struct is. When the filter is written in userspace, it can
> > do the math, point into the expected offsets, and get what it needs. For
> > this to work correctly in the kernel, though, the seccomp BPF verifier
> > needs to know the size of the EA struct as well, so it can correctly
> > perform the offset checking (as it currently does for just the
> > seccomp_data struct size).
> >
> > Since there is not really any caller-based "seccomp state" associated
> > across seccomp(2) calls, I don't think we can add a new command to tell
> > the kernel "I'm expecting the EA struct size to be $foo bytes", since
> > the kernel doesn't track who "I" is besides just being "current", which
> > doesn't take into account the thread lifetime -- if a process launcher
> > knows about one size and the child knows about another, things will get
> > confused. The sizes really are just associated with individual filters,
> > based on the syscalls they're examining. So, I have thoughts on possible
> > solutions:
> >
> > - create a new seccomp command SECCOMP_SET_MODE_FILTER2 which uses the
> > EA style so we can pass in more than a filter and include also an
> > array of syscall to size mappings. (I don't like this...)
> > - create a new filter flag, SECCOMP_FILTER_FLAG_EXTENSIBLE, which changes
> > the meaning of the uarg from "filter" to a EA-style structure with
> > sizes and pointers to the filter and an array of syscall to size
> > mappings. (I like this slightly better, but I still don't like it.)
> > - leverage the EA design and just accept anything <= PAGE_SIZE, record
> > the "max offset" value seen during filter verification, and zero-fill
> > the EA struct with zeros to that size when constructing the
> > seccomp_data + EA struct that the filter will examine. Then the seccomp
> > filter doesn't care what any of the sizes are, and userspace doesn't
> > care what any of the sizes are. (I like this as it makes the problems
> > to solve contained entirely by the seccomp infrastructure and does not
> > touch user API, but I worry I'm missing some gotcha I haven't
> > considered.)
>
> We don't need to actually zero-fill memory for this beyond what the
> kernel supports - AFAIK the existing APIs already say that passing a
> short length is equivalent to passing zeroes, so we can just replace
> all out-of-bounds loads with zeroing registers in the filter.
> The tricky case is what should happen if the userspace program passes
> in fields that the filter doesn't know about. The filter can see the
> length field passed in by userspace, and then just reject anything
> where the length field is bigger than the structure size the filter
> knows about. But maybe it'd be slightly nicer if there was an
> operation for "tell me whether everything starting at offset X is
> zeroes", so that if someone compiles with newer kernel headers where
> the struct is bigger, and leaves the new fields zeroed, the syscalls
> still go through an outdated filter properly.
I think the best way of handling this (without breaking programs
senselessly) is to have filters essentially emulate
copy_struct_from_user() semantics -- which is along the lines of what
you've suggested.
That means in addition to treating all offsets past the end of usize as
being zeroed, we also get filters to only reject programs with usize >
fsize (the size the filter knows) if the trailing (fsize - usize) bytes
are non-zero. Having an opcode for that (a-la bpf_check_uarg_tail_zero)
would be a good idea.
> > And then, my age-old concern, that maybe doesn't need a solution... I
> > remain plagued by the lack of pathname inspection. But I think the
> > ToCToU nature of it means we just cannot do it from seccomp. It does
> > make filtering openat2()'s EA struct a bit funny... a filter has no idea
> > what path it applies to... but that doesn't matter because the object
> > the path points to might change[6] during the syscall. Argh.
>
> I don't think seccomp needs to care about paths. Instead, you can use
> one of these three approaches:
>
> 1) You can make openat2() the only syscall that is allowed to take
> non-empty path arguments, and restrict it to
> RESOLVE_BENEATH|RESOLVE_IN_ROOT. (For APIs that use AT_EMPTY_PATH, we
> can probably come up with some way to say "this part must be an empty
> string" - e.g. by defining a new bogus placeholder pointer that you
> can use as "empty path", or something like that.) This is basically
> like the old capsicum O_BENEATH stuff, except with seccomp doing the
> enforcement that you're not using absolute paths or things like that.
The only concern I'd have with this is the dfd argument (would you only
permit AT_FDCWD?). If you want to let programs run inside a
pre-configured jail directory then you'd need to white-list only *that*
fd and then block all other syscalls which would let you overwrite that
fd with something else...
> 2) You can create a new mount namespace, then use open_tree() with
> OPEN_TREE_CLONE to create file descriptors to ephemeral bind mounts,
> then sandbox yourself with pivot_root().
> 3) You can use Mickael's landlock, once that's landed.
Or (4), just use LSMs to achieve this goal (which is basically 3, but
applies to all LSMs).
> > ## changing structure sizes
> >
> > Background: there have been regular needs to add things to various
> > seccomp structures. Each come with their own unique pains, and solving
> > this as completely as possible in a future-proof way would be very nice.
> >
> > As noted in "fd passing" above, there is a desire to add some useful
> > things to the user_notif struct (e.g. thread group pid). Similarly,
> > there have been requests in the past (though I can't find any good
> > references right now, just my indirect comments[3]) to grow seccomp_data.
>
> This thing (which hasn't landed so far, but would be a really awesome
> feature) needed to add stuff to seccomp_data:
> <https://lore.kernel.org/linux-api/20181029112343.27454-1-msammler@xxxxxxxxxxx/>
>
> > Combined with the EA struct work above, there's a clear need for seccomp
> > to reexamine how it deals with its API structures (and how this
> > interacts with filters).
> >
> > First, let's consider seccomp_data. If we grow it, the EA struct offset
> > will move, based on the deep arg inspection design above. Alternatively,
> > we could instead put seccomp_data offset 0, and EA struct at offset
> > PAGE_SIZE, and treat seccomp_data itself as an EA struct where we let
> > the filter access whatever it thinks is there, with it being zero-filled
> > by the kernel. For any values where 0 is valid, there will just need to
> > be a "is that field valid?" bit before it:
> >
> > unsigned long feature_bits;
> > unsigned long interesting_thing_1;
> > unsigned long interesting_thing_2;
> > unsigned long interesting_thing_3;
> > ...
> >
> > and the filter would check feature_bits...
>
> (Apart from the user_notif stuff, those feature bits would not
> actually have to exist in memory; they could be inlined while loading
> the program. Actually, not even the registers would have to exist in a
> seccomp_data struct in memory, we could just replace the loads with
> reads from the pt_regs, too.)
>
> > (However, this needs to be carefully considered given that seccomp_data
> > is embedded in user_notif... should the EA struct from userspace also be
> > copied into user_notif? More thoughts on this below...)
> >
> > For user_notif, I think we need something in and around these options:
> >
> > - make a new API that explicitly follows EA struct design
> > (and while read()/write() might be easier[4], I tend to agree with
> > Jann and we need to stick to ioctl(): as Tycho noted, "read/write is
> > for data". Though I wonder if read() could be used for the notifications,
> > which ARE data, and use ioctl() for the responses?)
>
> Just as a note: If we use read() there, we'll never be able to
> transfer things like FDs through that API.
And we run into the age-old "read() for management can be a bit hairy"
problem.
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
Attachment:
signature.asc
Description: PGP signature