Re: Using ftrace/perf as a basis for generic seccomp

From: Frederic Weisbecker
Date: Thu Feb 03 2011 - 14:06:57 EST


On Wed, Feb 02, 2011 at 11:45:22AM -0500, Eric Paris wrote:
> On Wed, 2011-02-02 at 13:26 +0100, Ingo Molnar wrote:
> > * Masami Hiramatsu <masami.hiramatsu.pt@xxxxxxxxxxx> wrote:
> >
> > > Hi Eric,
> > >
> > > (2011/02/01 23:58), Eric Paris wrote:
> > > > On Wed, Jan 12, 2011 at 4:28 PM, Eric Paris <eparis@xxxxxxxxxx> wrote:
> > > >> Some time ago Adam posted a patch to allow for a generic seccomp
> > > >> implementation (unlike the current seccomp where your choice is all
> > > >> syscalls or only read, write, sigreturn, and exit) which got little
> > > >> traction and it was suggested he instead do the same thing somehow using
> > > >> the tracing code:
> > > >> http://thread.gmane.org/gmane.linux.kernel/833556
> > >
> > > Hm, interesting idea :)
> > > But why would you like to use tracing code? just for hooking?
> >
> > What I suggested before was to reuse the scripting engine and the tracepoints.
> >
> > I.e. the "seccomp restrictions" can be implemented via a filter expression - and the
> > scripting engine could be generalized so that such 'sandboxing' code can make use of
> > it.
> >
> > For example, if you want to restrict a process to only allow open() syscalls to fd 4
> > (a very restrictive sandbox), it could be done via this filter expression:
> >
> > 'fd == 4'
> >
> > etc. Note that obviously the scripting engine needs to be abstracted out somewhat -
> > but this is the basic idea, to reuse the callbacks and reuse the scripting engine
> > for runtime filtering of syscall parameters.
>
> Any pointers on what is involved in this abstraction? I can work out
> the details, but I don't know the big picture well enough to even start
> to move forwards.....

In the big picture, the filtering code is very tight to the tracing code.
Creation, initialization, removal of filters is all made on top of the
trace events structures (struct ftrace_event_call) because we apply and
interpret filters on the fields of trace events, which are what we save
in a trace.

Example:

If you look at the sched switch trace events, we have several fields
like prev_comm and next_comm. These are defined in the TRACE_EVENT()
macros calls. So when we apply a filter like "prev_comm == firefox-bin",
we enter the filtering code with the trace_event structure for sched
switch events and iterate through its fields to find one called
prev_comm and then we work on top of that.
I think you won't work with trace events, so you need to make the
filtering code more tracing-agnostic.

But I think it's quite workable and shouldn't be too hard to split that
into a filtering backend. Many parts are already pretty standalone.

Also I suspect the tracepoints are not what you need. Or may be
they are. But as Masami said, the syscall tracepoint is called late.
It's workable though. The other problem is that preemption is disabled
when tracepoints are called, which is probably not what you want.
One day I think we'll need to unify the tracepoints and notifier
code but until then, better keep tracepoints for tracing.

Now once you have the filtering code more generic, you still
need an arch backend to map register contents and layout into syscall
arguments name and type. On top of which you can finally use the filtering
code. For that you can use, again, some code we use for tracing, which
are syscalls metadata: informations generated on build time
that have syscalls fields and type.
And that also needs to be split up, but it's more trivial
than the filtering part.

Note for now, filtering + syscalls metadata only works on top
of raw arguments value. Syscalls metadata don't know much
about type semantics and won't help you to dereference
syscall argument pointers. Only raw syscall parameter values.
Similarly, the filtering code can't evaluate pointer dereferencing
expression evaluation, only direct values comprehension.

But please note this is all features we want in the long term
anyway, using the kprobe expression code to intepret dereferencing,
and have more type introspection into kernel structures for
smarter syscalls metadata. And we can do that all gradually
without breaking backward.

Now with the current features you'll already have access to
a much more powerful seccomp implementation.

And if you have questions about anything, please don't hesitate.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/