Re: [PATCH v10 09/12] arch/x86: enable task isolation functionality

From: Andy Lutomirski
Date: Wed Mar 09 2016 - 16:58:19 EST

[adding Kenton -- you do interesting things with seccomp, too]

On Mar 9, 2016 1:25 PM, "Kees Cook" <keescook@xxxxxxxxxxxx> wrote:
> On Wed, Mar 9, 2016 at 1:18 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
> > On Wed, Mar 9, 2016 at 1:10 PM, Kees Cook <keescook@xxxxxxxxxxxx> wrote:
> >> On Wed, Mar 9, 2016 at 12:58 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
> >>> On Tue, Mar 8, 2016 at 12:40 PM, Chris Metcalf <cmetcalf@xxxxxxxxxxxx> wrote:
> >>>> On 03/07/2016 03:55 PM, Andy Lutomirski wrote:
> >>>>>>>
> >>>>>>> Let task isolation users who want to detect when they screw up and do
> >>>>>>> >>a syscall do it with seccomp.
> >>>>>>
> >>>>>>
> >>>>>> >Can you give me more details on what you're imagining here? Remember
> >>>>>> >that a key use case is that these applications can remove the syscall
> >>>>>> >prohibition voluntarily; it's only there to prevent unintended uses
> >>>>>> >(by third party libraries or just straight-up programming bugs).
> >>>>>> >As far as I can tell, seccomp does not allow you to go from "less
> >>>>>> >permissive" to "more permissive" settings at all, which means that as
> >>>>>> >it exists, it's not a good solution for this use case.
> >>>>>> >
> >>>>>> >Or were you thinking about a new seccomp API that allows this?
> >>>>>
> >>>>> I was. This is at least the second time I've wanted a way to ask
> >>>>> seccomp to allow a layer to be removed.
> >>>>
> >>>>
> >>>> Andy,
> >>>>
> >>>> Please take a look at this draft patch that intends to enable seccomp
> >>>> as something that task isolation can use.
> >>>
> >>> Kees, this sounds like it may solve your self-instrumentation problem.
> >>> Want to take a look?
> >>
> >> Errrr... I'm pretty uncomfortable with this. I really would like to
> >> keep the basic semantics of seccomp is simple as possible: filtering
> >> only gets more restricted.
> The other problem is that this won't work if the third-party code
> actually uses seccomp itself... this isn't composable as-is.

It kind of is. Set it up to trap to SIGSYS on any unexpected
seccomp() call. Then emulate it.

To make this slightly cleaner, there could be a variant of the flag in
which a poppable seccomp filter is set to pop itself if it generates
SIGSYS. Then you'd make sure to always trap seccomp and sigaction --
you'd have to emulate those two for composability.

Presumably, for sanity, it would be illegal to have any filter at all
stacked on top of this type of filter. We'd also probably want to
prevent installation of a non-poppable filter on top of a poppable
filter -- that wouldn't make much sense.

Just to muddy the waters, there's another possible use case for this:
a sandbox program could mprotect all its critical data structures
readonly or even PROT_NONE, set up sigaltstack and a very carefully
written SIGSYS handler, install a self-popping signal handler that
turns *everything* into SIGSYS, and then jump to untrusted code. Now
we can finally have a trusted in-process supervisor that can't be
tampered with because it's only privileged if it's entered through a
special gate (i.e. SIGSYS).

> >>
> >> This doesn't really solve my self-instrumentation desires since I
> >> still can't sanely deliver signals. I would need a lot more
> >> convincing. :)
> >>
> >
> > I think you could do it by adding a filter that turns all the unknown
> > things into SIGSYS, allows sigreturn, and allows the seccomp syscall,
> > at least in the pop-off-the-filter variant. Then you add this
> > removably.
> >
> > In the SIGSYS handler, you pop off the filter, do your bookkeeping,
> > update the filter, and push it back on.
> No, this won't let the original syscall through. I wanted to be able
> to document the syscalls as they happened without needing audit or a
> ptrace monitor. I am currently convinced that my desire for this is no
> good, and it should just be done with a ptrace monitor...

It can let the original through -- just do the arch-specific restart
incantation before you return. On x86, reload EAX/RAX and backtrack
IP by 2. As of Linux 4.4, the code for this is sane and has a good
test case. On older 64-bit kernels on AMD running 32-bit code, there
could be odd side effects.

If I ever factor out my SIGSYS decoder, I'll add a restart helper.