The problem of setgroups and containers

From: Eric W. Biederman
Date: Sun Oct 18 2020 - 09:05:08 EST


[ Added linux-api because we are talking about a subtle semantic
change to the permission checks ]

Christian Brauner <christian.brauner@xxxxxxxxxx> writes:

> On Sat, Oct 17, 2020 at 11:51:22AM -0500, Eric W. Biederman wrote:
>> "Enrico Weigelt, metux IT consult" <lkml@xxxxxxxxx> writes:
>>
>> > On 30.08.20 16:39, Christian Brauner wrote:
>> >> For mount points
>> >> that originate from outside the namespace, everything will show as
>> >> the overflow ids and access would be restricted to the most
>> >> restricted permission bit for any path that can be accessed.
>> >
>> > So, I can't just take a btrfs snapshot as rootfs anymore ?
>>
>> Interesting until reading through your commentary I had missed the
>> proposal to effectively effectively change the permissions to:
>> ((mode >> 3) & (mode >> 6) & mode & 7).
>>
>> The challenge is that in a permission triple it is possible to set
>> lower permissions for the owner of the file, or for a specific group,
>> than for everyone else.
>>
>> Today we require root permissions to be able to map users and groups in
>> /proc/<pid>/uid_map and /proc/<pid>/gid_map, and we require root
>> permissions to be able to drop groups with setgroups.
>>
>> Now we are discussiong moving to a world where we can use users and
>> groups that don't map to any other user namespace in uid_map and
>> gid_map. It should be completely safe to use those users and groups
>> except for negative permissions in filesystems. So a big question is
>> how do we arrange the system so anyone can use those files without
>> negative permission causing problems.
>>
>>
>> I believe it is safe to not limit the owner of a file, as the
>> owner of a file can always chmode the file and remove any restrictions.
>> Which is no worse than calling setuid to a different uid.
>>
>> Which leaves where we have been dealing with the ability to drop groups
>> with setgroups.
>>
>> I guess the practical proposal is when the !in_group_p and we are
>> looking at the other permission. Treat the permissions as:
>> ((mode >> 3) & mode & 7). Instead of just (mode & 7).
>>
>> Which for systems who don't use negative group permissions is a no-op.
>> So this should not effect your btrfs snapshots at all (unless you use
>> negative group permissions).
>>
>> It denies things before we get to an NFS server or other interesting
>> case so it should work for pretty much everything the kernel deals with.
>>
>> Userspace repeating permission checks could break. But that is just a
>> problem of inconsistency, and will always be a problem.
>>
>> We could make it more precise as Serge was suggesting with a set of that
>> were dropped from setgroups, but under the assumption that negative
>> groups are sufficient rare we can avoid that overhead.
>
> I'm tempted to agree and say that it's safe to assume that they are used
> very much. Negative acls have been brought up a couple of times in
> related contexts though. One being a potential bug in newgidmap which we
> discussed back in
> https://bugs.launchpad.net/ubuntu/+source/shadow/+bug/1729357
> But I think if we have this under a sysctl as proposed earlier is good
> enough.
>
>>
>> static int acl_permission_check(struct inode *inode, int mask)
>> {
>> unsigned int mode = inode->i_mode;
>>
>> - [irrelevant bits of this function]
>>
>> /* Only RWX matters for group/other mode bits */
>> mask &= 7;
>>
>> /*
>> * Are the group permissions different from
>> * the other permissions in the bits we care
>> * about? Need to check group ownership if so.
>> */
>> if (mask & (mode ^ (mode >> 3))) {
>> if (in_group_p(inode->i_gid))
>> mode >>= 3;
>> + /* Use the most restrictive permissions? */
>> + else (current->user_ns->flags & USERNS_ALWAYS_DENY_GROUPS)
>> + mode &= (mode >> 3);
>> }
>>
>> /* Bits in 'mode' clear that we require? */
>> return (mask & ~mode) ? -EACCES : 0;
>> }
>>
>> As I read posix_acl_permission all of the posix acls for groups are
>> positive permissions. So I think the only other code that would need to
>> be updated would be the filesystems that replace generic_permission with
>> something that doesn't call acl_permission check.
>>
>> Userspace could then activate this mode with:
>> echo "safely_allow" > /proc/<pid>/setgroups
>>
>> That looks very elegant and simple, and I don't think will cause
>> problems for anyone. It might even make sense to make that the default
>> mode when creating a new user namespace.
>>
>> I guess we owe this idea to Josh Triplett and Geoffrey Thomas.
>>
>> Does anyone see any problems with tweaking the permissions this way so
>> that we can always allow setgroups in a user namespace?
>
> This looks sane and simple. I would still think that making it opt-in
> for a few kernel releases might be preferable to just making it the new
> default. We can then revisit flipping the default. Advanced enough
> container runtimes will quickly pick up on this and can make it the
> default for their unprivileged containers if they want to.

I think we can even do a little bit better than what I proposed above.
The downside of my code is that negtative acls won't work in containers
even if they do today. (Not that I think negative acls are something to
encourage just that breaking them means we have to deal with the
question: "Does someone care?").

What we can very safely do is limit negative acls to filesystems that
are mounted in the same user namespace. Like the code below.

static int acl_permission_check(struct inode *inode, int mask)
{
unsigned int mode = inode->i_mode;

- [irrelevant bits of this function]

/* Only RWX matters for group/other mode bits */
mask &= 7;

/*
* Are the group permissions different from
* the other permissions in the bits we care
* about? Need to check group ownership if so.
*/
if (mask & (mode ^ (mode >> 3))) {
if (in_group_p(inode->i_gid))
mode >>= 3;
+ /*
+ * In a user namespace groups may have been dropped
+ * so use the most restrictive permissions.
+ */
+ else if (current->user_ns != inode->i_sb->user_ns)
+ mode &= (mode >> 3);
}

/* Bits in 'mode' clear that we require? */
return (mask & ~mode) ? -EACCES : 0;
}

I would make the plan that we apply the fully fleshed out version of the
above (aka updating the permission methods that don't use
generic_permission), and then in a following kernel cycle we remove the
restrictions on setgroups because they are no longer needed.

The only possible user breaking issue I can see if a system with
negative acls where the containers rely on having access to the other
permissions for some reason. If someone finds a system that does this
change would need to be reverted and another plan would need to be
found. Otherwise I think/hope this is a safe semantic change.

Does anyone see any problems with my further simplification?

Eric