Re: [PATCH] make TIOCSTI ioctl require CAP_SYS_ADMIN
From: Serge E. Hallyn
Date: Sat Apr 22 2017 - 15:51:17 EST
Quoting Matt Brown (matt@xxxxxxxxx):
> On 04/21/2017 01:24 AM, Serge E. Hallyn wrote:
> >On Fri, Apr 21, 2017 at 01:09:59AM -0400, Matt Brown wrote:
> >>On 04/20/2017 01:41 PM, Serge E. Hallyn wrote:
> >>>Quoting matt@xxxxxxxxx (matt@xxxxxxxxx):
> >>>>On 2017-04-20 11:19, Serge E. Hallyn wrote:
> >>>>>Quoting Matt Brown (matt@xxxxxxxxx):
> >>>>>>On 04/19/2017 07:53 PM, Serge E. Hallyn wrote:
> >>>>>>>Quoting Matt Brown (matt@xxxxxxxxx):
> >>>>>>>>On 04/19/2017 12:58 AM, Serge E. Hallyn wrote:
> >>>>>>>>>On Tue, Apr 18, 2017 at 11:45:26PM -0400, Matt Brown wrote:
> >>>>>>>>>>This patch reproduces GRKERNSEC_HARDEN_TTY functionality from the grsecurity
> >>>>>>>>>>project in-kernel.
> >>>>>>>>>>
> >>>>>>>>>>This will create the Kconfig SECURITY_TIOCSTI_RESTRICT and the corresponding
> >>>>>>>>>>sysctl kernel.tiocsti_restrict that, when activated, restrict all TIOCSTI
> >>>>>>>>>>ioctl calls from non CAP_SYS_ADMIN users.
> >>>>>>>>>>
> >>>>>>>>>>Possible effects on userland:
> >>>>>>>>>>
> >>>>>>>>>>There could be a few user programs that would be effected by this
> >>>>>>>>>>change.
> >>>>>>>>>>See: <https://codesearch.debian.net/search?q=ioctl%5C%28.*TIOCSTI>
> >>>>>>>>>>notable programs are: agetty, csh, xemacs and tcsh
> >>>>>>>>>>
> >>>>>>>>>>However, I still believe that this change is worth it given that the
> >>>>>>>>>>Kconfig defaults to n. This will be a feature that is turned on for the
> >>>>>>>>>
> >>>>>>>>>It's not worthless, but note that for instance before this was fixed
> >>>>>>>>>in lxc, this patch would not have helped with escapes from privileged
> >>>>>>>>>containers.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>I assume you are talking about this CVE:
> >>>>>>>>https://bugzilla.redhat.com/show_bug.cgi?id=1411256
> >>>>>>>>
> >>>>>>>>In retrospect, is there any way that an escape from a privileged
> >>>>>>>>container with the this bug could have been prevented?
> >>>>>>>
> >>>>>>>I don't know, that's what I was probing for. Detecting that the pgrp
> >>>>>>>or session - heck, the pid namespace - has changed would seem like a
> >>>>>>>good indicator that it shouldn't be able to push.
> >>>>>>>
> >>>>>>
> >>>>>>pgrp and session won't do because in the case we are discussing
> >>>>>>current->signal->tty is the same as tty.
> >>>>>>
> >>>>>>This is the current check that is already in place:
> >>>>>>| if ((current->signal->tty != tty) && !capable(CAP_SYS_ADMIN))
> >>>>>>| return -EPERM;
> >>>>>
> >>>>>Yeah...
> >>>>>
> >>>>>>The only thing I could find to detect the tty message coming from a
> >>>>>>container is as follows:
> >>>>>>| task_active_pid_ns(current)->level
> >>>>>>
> >>>>>>This will be zero when run on the host, but 1 when run inside a
> >>>>>>container. However this is very much a hack and could probably break
> >>>>>>some userland stuff where there are multiple levels of namespaces.
> >>>>>
> >>>>>Yes. This is also however why I don't like the current patch, because
> >>>>>capable() will never be true in a container, so nested containers
> >>>>>break.
> >>>>>
> >>>>
> >>>>What do you mean by "capable() will never be true in a container"?
> >>>>My understanding
> >>>>is that if a container is given CAP_SYS_ADMIN then
> >>>>capable(CAP_SYS_ADMIN) will return
> >>>>true?
> >>>
> >>>No, capable(X) checks for X with respect to the initial user namespace.
> >>>So for root-owned containers it will be true, but containers running in
> >>>non-initial user namespaces cannot pass that check.
> >>>
> >>>To check for privilege with respect to another user namespace, you need
> >>>to use ns_capable. But for that you need a user_ns to target.
> >>>
> >>
> >>How about: ns_capable(current_user_ns(),CAP_SYS_ADMIN) ?
> >>
> >>current_user_ns() was found in include/linux/cred.h
> >
> >Any user can create a new user namespace and pass the above check. What we
> >want is to find the user namespace which opened the tty.
> >
>
> I believe I have a working solution that I can show in the next version
> of the patch later today, but I just want to run the logic by you first.
>
> I added: "struct user_namespace *owner_user_ns;" as a field in
> tty_struct (include/linux/tty.h) Note: I am totally open to suggestions
> for a better name.
>
> Then I added "tty->owner_user_ns = current_user_ns();" to the
> alloc_tty_struct function. (drivers/tty/tty_io.c)
That's what I was hoping could work. Then you can check ns_capable
with respect to that.
You'll want to grab a reference to the user_ns, and drop it on
final close, but otherwise this sounds good to me. I don't really
know the tty layer well though so we'll need some sanity checking
from someone who does.
> When testing with a docker container, running in a different user
> namespace, I printed out current_user_ns()->level, which returned 1,
> and tty->owner_user_ns->level, which returned 0. This seems to prove
> that I am correctly storing the user namespace which opened the tty.
>
> Please let me know if there are any edge cases that I am missing with
> this approach.
Thanks for posting this! This seems like the best solution to me.