Re: [PATCH] devpts: Make each mount of devpts an independent filesystem.

From: Eric W. Biederman
Date: Wed Apr 20 2016 - 12:02:52 EST

Konstantin Khlebnikov <koct9i@xxxxxxxxx> writes:

> On Wed, Apr 20, 2016 at 5:55 PM, Eric W. Biederman
> <ebiederm@xxxxxxxxxxxx> wrote:
>> Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> writes:
>>> On Tue, Apr 19, 2016 at 9:36 PM, Konstantin Khlebnikov <koct9i@xxxxxxxxx> wrote:
>>>> On Wed, Apr 20, 2016 at 6:04 AM, Eric W. Biederman
>>>>> The kernel.pty.reserve sysctl is neutered with no way currently
>>>>> implemented to be able to use the reserved ptys.
>>>> I think we could convert this into reserve for init user namespace,
>>>> ssh in host will work even if containers eaten all ptys.
>>> Yes. That's basically how it effectively worked before (ie everything
>>> but the initial non-newinstance devpts mount would be limited to the
>>> non-reserved numbers).
>>> We required the non-init namespaces to do a newinstance mount, so the
>>> whole test for "newinstance" was effectively the same thing as just
>>> checking for the init namespace from a security standpoint.
>>> And in fact, rewriting it in that form (ie checking for init_ns) would
>>> just make it much more obvious what the intent it.
>> How does this sound.
>> When mounting a devpts filesystem. We look at the caller (aka current)
>> and if we are in the initial mount namespace set a flag in fsi that
>> allows that instance of devpts to draw into the reserve pool.
> Maybe just check current user namespace when task opens /dev/ptmx?
> IIRR now check looks like: count < limit - (newinstance ? reserved : 0).
> So, it will be count < limit - (current_in_init_userns ? 0 : newinstance).

Looking at current user namespace really is not enough. Lots of
container solutions at least historically (which means deployed right
now) don't use the user namespace.

I can see an argument to make the check: "capable(CAP_SYS_RESOURCE)".
Although for pty applications I don't know if that is particularly

I am a little dubious of making it a check at allocation time rather
than at mount time. The issue is that tty allocation is an unprivileged
operation. I expect applications such as sshd (the one that really
matters) will have droped privileges by the time they allocation a pty.

So I feel much more comfortable with a model where things are arranged
so that applications within access of the devpts filesystem can use it
(and are not limited), and applications not in range can't. Roughly the
authenticate at open time model. Also what devpts implements today.

The is also the question of how things should work if you are running in
a system where every new daemon, and every new login is in it's own
mount namespace. Allowing each of these to have a distinct /tmp
directory. I believe systemd systems are well on their way to doing
that today. As such it does not seem appropriate to check the mount
namespace of the opener of the tty.

Who knows we may not be long until the pty master lives in some very
tight bubble where it can barely do anything (as that is the program
that talks to the network) and user namespaces are used as part of the
enforcement of that.

For all of those reasons, a permission check in devpts_pty_new seems
like the wrong place.