Re: RFC: fsyscall

From: David Drysdale
Date: Wed Sep 09 2015 - 13:27:35 EST


On Wed, Sep 9, 2015 at 1:25 AM, Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote:
> Andy Lutomirski <luto@xxxxxxxxxxxxxx> writes:
> > On Tue, Sep 8, 2015 at 4:07 PM, Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote:
>
> >> Perhaps I had missed it but I don't recall capsicum being able to wrap
> >> things like reboot(2).
> >>
> >
> > Ah, so you want to be able to grant BPF-defined capabilities :)
>
> Pretty much.
>
> Where I am focusing is turning Posix capabilities into real
> capabilities. I would not mind if the functionality was a bit more
> general. Say to be able to handle things like security labels, or
> anywhere else you might reasonably be asked can you do X?
>
> But I would be happy if we just managed to wrap the Posix capabilities
> and turned them into real capablilities.

Interesting idea! So kind of like the "object" in question is the root
role, and the different rights for the corresponding object-capability
(the file descriptor) are the POSIX capabilities (in the simple case
at least).

And yes, Capsicum doesn't generally interact with things like reboot(2);
its checks are on top of any DAC policies rather than instead of them,
as it's a hybrid rather than a pure object-capability system.

> > Off the top of my head, I think that doing this using a nice IPC
> > mechanism (which barely exists in Linux, but which seL4 and binder (!)
> > can do very cleanly) would be simpler and more general, if less
> > self-contained.
>
> Less self-contained becomes a problem when you want to pass them between
> processes written at different times between different people. If there
> is something conceptually simple we can implement in the kernel it
> becomes worth it because that becomes the standard which everyone knows
> to code to.
>
> > (Aside: how on earth does anyone think that replacing binder with
> > kdbus makes any sense? Binder can pass capabilities, and kdbus can't.
> > OTOH, maybe Android doesn't use the capability-passing ability.)
>
> kdbus has file descriptor passing. Beyond that no comment.
>
> >> Which really describes what I am trying to tackle. How do we create an
> >> object that we can pass between processes that limits what we can do in
> >> the case of the oddball syscalls that require special privileges.
> >>
> >> At the same time I still want the caller to be able to pass in data to
> >> the system calls being called such as REBOOT_CMD_POWER_OFF versus
> >> REBOOT_CMD_HALT, while being able to filter it and say you may not pass
> >> REBOOT_CMD_CAD_OFF.
> >>
> >
> > We could have a conservative whitelist of syscalls for which we allow
> > this usage. I'm a bit worried that there will be very limited use
> > cases, given that a lot of use cases will want to follow pointers,
> > which has TOCTOU problems.
>
> Time of check to time of use problems. Interesting point.
>
> TOCTOU seems to make filtering of system calls in general much less
> viable then I had hoped or imagined, and seems to be one of the better
> arguments I have heard against ioctls.

By the way, Robert Watson (one of the progenitors of Capsicum, as it
happens) has a nice paper about TOCTOU attacks on syscall interposition
layers that's a good read:
http://www.watson.org/~robert/2007woot/

(From this perspective, the limitation that seccomp-bpf programs only
have access to syscall arguments by-value is actually a help -- the filter
can't look into user memory, so can't be fooled by having memory
contents changed underneath it. Of course, if the eBPF stuff ever
changes that we should watch out...)

> I think the cases I care about are much less likely to have TOCTOU
> problems than system calls in general, so I still may be ok.
>
> However it does seem like past a certain point for good filtering the
> entire syscall ABI needs to be turned into well defined IPC. Ick!

That's roughly one of Robert's suggestions (section 8.2).

> Sigh. I guess it is about time I dig up the places we call capable.
> Ugh 1696 places in the kernel.. Even filtering out CAP_SYS_ADMIN and
> CAP_NET_ADMIN the list is longer than I can easily look at.
>
> Still reboot isn't a problem ;)
>
> Thinking abou the TOCTOU problems with system call filtering the only
> general solution I can see is to handle it like the compat syscalls
> but instead of copying things into a temporary on buffer in userspace
> we copy the data into a temporary in-kernel buffer (filter the system call)
> fs = get_fs();
> set_fs(get_ds());
> /* Call the system call */
> set_fs(fs);
>
> I don't like the whole set_fs() thing (especially if there is any data
> we did not manage to copy). But it seems like a good conceptual start.

Doing the copies sounds like it would involve understanding & reproducing
the memory layouts for every syscall pointer argument, which would be a
lot of code. Or am I misunderstanding something?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/