[PATCH 0/7] Revisiting expanded seccomp functionality

From: Will Drewry
Date: Wed Apr 27 2011 - 23:14:42 EST


I'd like to revisit the past discussions around extending seccomp
functionality:
1. http://lwn.net/Articles/332438/
2. http://thread.gmane.org/gmane.linux.kernel/1086816/focus=1096626

First, some background and motivation, feel free to skip straight to the
patches!

kernel/seccomp.c provides early system call interception hooks which
have been used for reducing the kernel attack surface for a given
user-level task. Normally, seccomp limits the kernel interfaces to read,
write, sigreturn, and exit. These restrictions have proved effective,
but for many common uses, the model is too draconion. That reality
doesn't mean that a less aggressive reduction of the attack surface
wouldn't still have beneficial effects.

To accomodate the lack of flexibility, there are several out-of-tree
patches for system call interception (with and without farther reaching
"policy" enforcements) and even a complex pure-assembly trusted
supervisor-thread to broker the requests of seccomp-guarded threads
(http://code.google.com/p/seccompsandbox). The latter requires severe
contortions with a high chance of accidental attack surface exposure
while out-of-tree patches are just that. (This ignores the handful of
userspace solutions, like plash and systrace, which jump through their
own hurdles and suffer not only from complexity but from a heavy
performance penalty. Of course, those approaches often include policy
enforcement work in addition to pure attack surface reduction, but
that's tangential.)

In general, attack surface reduction is applicable in most
circumstances, but it is especially true when handling untrusted data
(which seccomp was originally meant to help with!).

Some simple motivating examples are as follows:
- disallowing perf system calls inside a selinux sandbox (before parsing
occursm such that true policy logic can be applied when appropriate.)
- minimizing kernel attack surface during untrusted JIT execution
(Actionscript, Javascript, etc).
- ...

This patchset provides a flexible means to perform kernel attack surface
reduction using the early seccomp system call hooks and the ftrace
filter engine for system call name to number translation along with
limited argument-based filtering decision making.

Patches 1 through 5 cover the meat of this change. Patch 3 contains the
more controversial pieces, I suspect. Patches 6 and 7 show some of the
work that is needed to make this system even more effective. (Even
without those patches, it is still quite useful.)

Core changes as part of this proposal:
[PATCH 1/7] tracing: split out filter init, access, tear down.
[PATCH 2/7] tracing: split out syscall_trace_enter construction
[PATCH 3/7] seccomp_filter: Enable ftrace-based system call filtering
[PATCH 4/7] seccomp_filter: add process state reporting
[PATCH 5/7] seccomp_filter: Document what seccomp_filter is and how it works.

Nice-to-haves, imo, for ftrace and this proposal:
[PATCH 6/7] include/linux/syscalls.h: add __ layer of macros with return types.
[PATCH 7/7] arch/x86: hook int returning system calls

Any and all commentary will be appreciated!

I feel that the approach of this patch series addresses both the
continued need for attack surface reduction when handling untrusted
content, as well as the need to reuse the developing ftrace
infrastructure. I'm certain there are bugs, style-issues, etc, but I
hope that the general design leaves everyone else feeling that this
approach also addresses those needs too. I will happily address any
issues if it means we might make progress on this iteration of the
exposed-kernel-surface-discussion!

Thanks!
will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/