Re: seccomp feature development

From: Aleksa Sarai
Date: Tue May 19 2020 - 09:43:20 EST


On 2020-05-19, Christian Brauner <christian.brauner@xxxxxxxxxx> wrote:
> On Tue, May 19, 2020 at 05:09:29PM +1000, Aleksa Sarai wrote:
> > On 2020-05-18, Kees Cook <keescook@xxxxxxxxxxxx> wrote:
> > > - the sizes of these EA structs are, by design, growable over time.
> > > seccomp and its users need to be handle this in a forward and backward
> > > compatible way, similar to the design of the EA syscall interface
> > > itself.
> > >
> > > The growing size of the EA struct will need some API design. For filters
> > > to operate on the contiguous seccomp_data+EA struct, the filter will
> > > need to know how large seccomp_data is (more on this later), and how
> > > large the EA struct is. When the filter is written in userspace, it can
> > > do the math, point into the expected offsets, and get what it needs. For
> > > this to work correctly in the kernel, though, the seccomp BPF verifier
> > > needs to know the size of the EA struct as well, so it can correctly
> > > perform the offset checking (as it currently does for just the
> > > seccomp_data struct size).
> > >
> > > Since there is not really any caller-based "seccomp state" associated
> > > across seccomp(2) calls, I don't think we can add a new command to tell
> > > the kernel "I'm expecting the EA struct size to be $foo bytes", since
> > > the kernel doesn't track who "I" is besides just being "current", which
> > > doesn't take into account the thread lifetime -- if a process launcher
> > > knows about one size and the child knows about another, things will get
> > > confused. The sizes really are just associated with individual filters,
> > > based on the syscalls they're examining. So, I have thoughts on possible
> > > solutions:
> > >
> > > - create a new seccomp command SECCOMP_SET_MODE_FILTER2 which uses the
> > > EA style so we can pass in more than a filter and include also an
> > > array of syscall to size mappings. (I don't like this...)
> > > - create a new filter flag, SECCOMP_FILTER_FLAG_EXTENSIBLE, which changes
> > > the meaning of the uarg from "filter" to a EA-style structure with
> > > sizes and pointers to the filter and an array of syscall to size
> > > mappings. (I like this slightly better, but I still don't like it.)
> > > - leverage the EA design and just accept anything <= PAGE_SIZE, record
> > > the "max offset" value seen during filter verification, and zero-fill
> > > the EA struct with zeros to that size when constructing the
> > > seccomp_data + EA struct that the filter will examine. Then the seccomp
> > > filter doesn't care what any of the sizes are, and userspace doesn't
> > > care what any of the sizes are. (I like this as it makes the problems
> > > to solve contained entirely by the seccomp infrastructure and does not
> > > touch user API, but I worry I'm missing some gotcha I haven't
> > > considered.)
> >
> > Okay, so here is my view on this. I think that the third option is
> > closest to what I'd like to see. Based on Jann's email, I think we're on
> > the same page but I'd just like to elaborate it a bit further:
> >
> > First of all -- ideally, the backward and forward compatibility that EA
> > syscalls give us should be reflected with seccomp filters being
> > similarly compatible. Otherwise we're going to run into issues where all
> > of the hard work with ensuring EA syscalls behave when extended will be
> > less valuable if seccomp cannot handle it sufficiently. This means that
> > I would hope that every combination of {old,new} filter/kernel/program
> > would work on a best-effort (but fail-safe) basis.
> >
> > In my view, the simplest way (from the kernel side) would be to simply
> > do what you outlined in (3) -- make all accesses past usize (and even
> > ksize) be zeroed.
> >
> > However in order to make an old filter fail-safe on a new kernel with a
> > new program, we'd need a new opcode which basically does
> > bpf_check_uarg_tail_zero() after a given offset into the EA struct. This
> > would punt the fail-safe problem to userspace (libseccomp would need to
> > generate a check that any unknown-to-the-filter fields are zero). I
> > don't think this is a decision we can make in-kernel because it might be
> > that the filter doesn't care about the last field in a struct (and thus
> > doesn't access it) but we don't know the difference between a field the
> > filter doesn't care about and a field it doesn't know about.
> >
> > Another issue which you haven't mentioned here is how do we deal with
> > pointers inside an EA struct. clone3() already has this with set_tid,
>
> It's also passed with a set_tid_size argument so we'd know how large
> set_tid is.

Right, of course -- that's a given. See below for the point I was
making.

> > and the thread on user_notif seems to be converging towards having the
> > user_notif2 struct just contain two pointers to EA structs. We could
> > punt on this for now, so long as we leave room for us to deal with this
> > problem in the future. However, I think it would be misguided to enable
> > deep argument inspection on syscalls which have such entries (such as
> > clone3) until we've figured out how we want to handle nested pointers --
>
> We can't really punt on this and should figure this out now. I've
> pointed to this problem in my other mail as well.
> Though I currently fail to see why this should be a huge problem to get
> that working as long as each pointer comes with a known size.

It's not a huge problem necessarily, but we would need to figure out how
we would represent the "nested pointers" in the flattened version which
the seccomp filter will actually offset into. Specifically, my point is
that if we imagine that the proposed new seccomp_data API is:

[<seccomp_data>][<EA struct contents>]

then any pointers in the EA struct will just be numerical values when
copied. How would write a filter on clone3() that requires the first
entry of set_tid to be 34? You couldn't with this simplified setup. And
unless I'm mistaken, BPF doesn't have pointer dereferences.

So we need to embed the "nested pointers" somehow in the flattened
version of the EA struct. I don't think we can just append them to the
main EA struct -- while you might have the length, that'd mean that
seccomp would generate an array of zeroes whose length is a
user-specified value? And then what about extensible structs that have
their size embedded inside them (as you or someone else suggested for
seccomp user_notif)? How would the eBPF program know the length of the
struct?

Maybe we could have some kind of jump table which the filter could use
(and just have there be a jump table for each embedded struct -- so
multiple embeddings have their own jump tables). So if we imagine some
user-extensible struct like

struct open_how {
u64 flags, mode, resolve;
struct foo *foo;
struct bar *bar;
struct baz *baz;
};

struct foo {
u64 foo;
};

struct bar {
u64 bar;
struct barbaz *barbaz;
};

struct barbaz {
u64 barbaz;
};

struct baz {
u64 baz;
};

Then we would lay out the seccomp_data as:

[<seccomp data>][<jump table><open_how>]
[<jump table><open_how->foo>]
[<jump table><open_how->bar>]
[<jump table><open_how->bar->barbaz>]
[<jump table><open_how->baz>]

The jump table could be as simple as a list of tuples (relative offset
of field in this struct from end of jump table, relative offset of the
copied structure at the end of the jump table). However, it's not
possible to do for-loops in cBPF -- so maybe we'd need to represent the
jump table differently (perhaps having it just be the same size as the
struct).

The above is just a rough sketch, maybe there's a much simpler solution
I'm not seeing.

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

Attachment: signature.asc
Description: PGP signature