Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF

From: Jamie Lokier
Date: Thu Jan 12 2012 - 12:58:20 EST


Will Drewry wrote:
> On Thu, Jan 12, 2012 at 11:22 AM, Jamie Lokier <jamie@xxxxxxxxxxxxx> wrote:
> > Will Drewry wrote:
> >> On Thu, Jan 12, 2012 at 9:43 AM, Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
> >> > On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote:
> >> >
> >> >> Filter programs may _only_ cross the execve(2) barrier if last filter
> >> >> program was attached by a task with CAP_SYS_ADMIN capabilities in its
> >> >> user namespace.  Once a task-local filter program is attached from a
> >> >> process without privileges, execve will fail.  This ensures that only
> >> >> privileged parent task can affect its privileged children (e.g., setuid
> >> >> binary).
> >> >
> >> > This means that a non privileged user can not run another program with
> >> > limited features? How would a process exec another program and filter
> >> > it? I would assume that the filter would need to be attached first and
> >> > then the execv() would be performed. But after the filter is attached,
> >> > the execv is prevented?
> >>
> >> Yeah - it means tasks can filter themselves, but not each other.
> >> However, you can inject a filter for any dynamically linked executable
> >> using LD_PRELOAD.
> >>
> >> > Maybe I don't understand this correctly.
> >>
> >> You're right on.  This was to ensure that one process didn't cause
> >> crazy behavior in another. I think Alan has a better proposal than
> >> mine below.  (Goes back to catching up.)
> >
> > You can already use ptrace() to cause crazy behaviour in another
> > process, including modifying registers arbitrarily at syscall entry
> > and exit, aborting and emulating syscalls.
> >
> > ptrace() is quite slow and it would be really nice to speed it up,
> > especially for trapping a small subset of syscalls, or limiting some
> > kinds of access to some file descriptors, while everything else runs
> > at normal speed.
> >
> > Speeding up ptrace() with BPF filters would be a really nice.  Not
> > that I like ptrace(), but sometimes it's the only thing you can rely on.
> >
> > LD_PRELOAD and code running in the target process address space can't
> > always be trusted in some contexts (e.g. the target process may modify
> > the tracing code or its data); whereas ptrace() is pretty complete and
> > reliable, if ugly.
> >
> > There's already a security model around who can use ptrace(); speeding
> > it up needn't break that.
> >
> > If we'd had BPF ptrace in the first place, SECCOMP wouldn't have been
> > needed as userspace could have done it, with exactly the restrictions
> > it wants.  Google's NaCl comes to mind as a potential user.
>
> That's not entirely true. ptrace supervisors are subject to races and
> always fail open. This makes them effective but not as robust as a
> seccomp solution can provide.

What races do you know about?

I'm not aware of any ptrace races if it's used properly. I'm also not
sure what you mean by fail open/close here, unless you mean the target
process gets to carry on if the tracing process dies.

Having said that, I can think of one race, but I think your BPF scheme
has the same one: After checking the syscall's string arguments and
other pointed to data, another thread can change those arguments
before the real syscall uses them.

> With seccomp, it fails close. What I think would make sense would be
> to add a user-controllable failure mode with seccomp bpf that calls
> tracehook_ptrace_syscall_entry(regs). I've prototyped this and it
> works quite well, but I didn't want to conflate the discussions.

It think it's a nice idea. While you're at it could you fix all the
architectures to actually use tracehooks for syscall tracing ;-)

(I think it's ok to call the tracehook function on all archs though.)

> Using ptrace() would also mean that all consumers of this interface
> would need a supervisor, but with seccomp, the filters are installed
> and require no supervisors to stick around for when failure occurs.
>
> Does that make sense?

It does, I agree that ptrace() is quite cumbersome and you don't
always want a separate tracing process, especially if "failure" means
to die or get an error.

On the other hand, sometimes when a failure occurs, having another
process decide what to do, or log the event, is exactly what you want.

For my nefarious purposes I'm really just looking for a faster way to
reliably trace some activities of individual processes, in particular
tracking which files they access. I'd rather not interfere with
debuggers, so I'd really like your ability to stack multiple filters
to work with separate-process tracing as well. And I'd happily use a
filter rule which can dump some information over a pipe, without
waiting for the tracer to respond in most cases.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/