Re: ioctl CAP_LINUX_IMMUTABLE is checked in the wrong namespace

From: Andy Lutomirski
Date: Tue Apr 29 2014 - 20:13:16 EST


On Tue, Apr 29, 2014 at 5:10 PM, Marian Marinov <mm@xxxxxx> wrote:
> On 04/30/2014 03:01 AM, StÃphane Graber wrote:
>>
>> On Tue, Apr 29, 2014 at 04:51:54PM -0700, Andy Lutomirski wrote:
>>>
>>> On Tue, Apr 29, 2014 at 4:47 PM, StÃphane Graber <stgraber@xxxxxxxxxx>
>>> wrote:
>>>>
>>>> On Tue, Apr 29, 2014 at 04:22:55PM -0700, Andy Lutomirski wrote:
>>>>>
>>>>> On Tue, Apr 29, 2014 at 4:20 PM, Marian Marinov <mm@xxxxxx> wrote:
>>>>>>
>>>>>> On 04/30/2014 01:45 AM, Andy Lutomirski wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 04/29/2014 03:29 PM, Serge Hallyn wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Quoting Marian Marinov (mm-108MBtLGafw@xxxxxxxxxxxxxxxx):
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 04/30/2014 01:02 AM, Serge Hallyn wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Quoting Marian Marinov (mm-108MBtLGafw@xxxxxxxxxxxxxxxx):
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 04/29/2014 09:52 PM, Serge Hallyn wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Quoting Theodore Ts'o (tytso-3s7WtUTddSA@xxxxxxxxxxxxxxxx):
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Apr 29, 2014 at 04:49:14PM +0300, Marian Marinov wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm proposing a fix to this, by replacing the
>>>>>>>>>>>>>> capable(CAP_LINUX_IMMUTABLE)
>>>>>>>>>>>>>> check with ns_capable(current_cred()->user_ns,
>>>>>>>>>>>>>> CAP_LINUX_IMMUTABLE).
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Um, wouldn't it be better to simply fix the capable() function?
>>>>>>>>>>>>>
>>>>>>>>>>>>> /**
>>>>>>>>>>>>> * capable - Determine if the current task has a superior
>>>>>>>>>>>>> capability in effect
>>>>>>>>>>>>> * @cap: The capability to be tested for
>>>>>>>>>>>>> *
>>>>>>>>>>>>> * Return true if the current task has the given superior
>>>>>>>>>>>>> capability currently
>>>>>>>>>>>>> * available for use, false if not.
>>>>>>>>>>>>> *
>>>>>>>>>>>>> * This sets PF_SUPERPRIV on the task if the capability is
>>>>>>>>>>>>> available on the
>>>>>>>>>>>>> * assumption that it's about to be used.
>>>>>>>>>>>>> */
>>>>>>>>>>>>> bool capable(int cap)
>>>>>>>>>>>>> {
>>>>>>>>>>>>> return ns_capable(&init_user_ns, cap);
>>>>>>>>>>>>> }
>>>>>>>>>>>>> EXPORT_SYMBOL(capable);
>>>>>>>>>>>>>
>>>>>>>>>>>>> The documentation states that it is for "the current task", and
>>>>>>>>>>>>> I
>>>>>>>>>>>>> can't imagine any use case, where user namespaces are in
>>>>>>>>>>>>> effect,
>>>>>>>>>>>>> where
>>>>>>>>>>>>> using init_user_ns would ever make sense.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> the init_user_ns represents the user_ns owning the object, not
>>>>>>>>>>>> the
>>>>>>>>>>>> subject.
>>>>>>>>>>>>
>>>>>>>>>>>> The patch by Marian is wrong. Anyone can do
>>>>>>>>>>>> 'clone(CLONE_NEWUSER)',
>>>>>>>>>>>> setuid(0), execve, and end up satisfying
>>>>>>>>>>>> 'ns_capable(current_cred()->userns,
>>>>>>>>>>>> CAP_SYS_IMMUTABLE)' by definition.
>>>>>>>>>>>>
>>>>>>>>>>>> So NACK to that particular patch. I'm not sure, but IIUC it
>>>>>>>>>>>> should
>>>>>>>>>>>> be
>>>>>>>>>>>> safe to check against the userns owning the inode?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> So what you are proposing is to replace
>>>>>>>>>>> 'ns_capable(current_cred()->userns, CAP_SYS_IMMUTABLE)' with
>>>>>>>>>>> 'inode_capable(inode, CAP_SYS_IMMUTABLE)' ?
>>>>>>>>>>>
>>>>>>>>>>> I agree that this is more sane.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Right, and I think the two operations you're looking at seem sane
>>>>>>>>>> to allow.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> If you are ok with this patch, I will fix all file systems and send
>>>>>>>>> patches.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Sounds good, thanks.
>>>>>>>>
>>>>>>>>> Signed-off-by: Marian Marinov <mm-NV7Lj0SOnH0@xxxxxxxxxxxxxxxx>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Acked-by: Serge E. Hallyn
>>>>>>>> <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@xxxxxxxxxxxxxxxx>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Wait, what?
>>>>>>>
>>>>>>> Inodes aren't owned by user namespaces; they're owned by users. And
>>>>>>> any
>>>>>>> user can arrange to have a user namespace in which they pass an
>>>>>>> inode_capable check on any inode that they own.
>>>>>>>
>>>>>>> Presumably there's a reason that CAP_SYS_IMMUTABLE is needed. If
>>>>>>> this
>>>>>>> gets merged, then it would be better to just drop CAP_SYS_IMMUTABLE
>>>>>>> entirely.
>>>>>>
>>>>>>
>>>>>>
>>>>>> The problem I'm trying to solve is this:
>>>>>>
>>>>>> container with its own user namespace and CAP_SYS_IMMUTABLE should be
>>>>>> able
>>>>>> to use chattr on all files witch this container has access to.
>>>>>>
>>>>>> Unfortunately with the capable(CAP_SYS_IMMUTABLE) check this is not
>>>>>> working.
>>>>>>
>>>>>> With the proposed two fixes CAP_SYS_IMMUTABLE started working in the
>>>>>> container.
>>>>>>
>>>>>> The first solution got its user namespace from the currently running
>>>>>> process
>>>>>> and the second gets its user namespace from the currently opened
>>>>>> inode.
>>>>>>
>>>>>> So what would be the best solution in this case?
>>>>>
>>>>>
>>>>> I'd suggest adding a mount option like fs_owner_uid that names a uid
>>>>> that owns, in the sense of having unlimited access to, a filesystem.
>>>>> Then anyone with caps on a namespace owned by that uid could do
>>>>> whatever.
>>>>>
>>>>> Eric?
>>>>>
>>>>> --Andy
>>>>
>>>>
>>>> The most obvious problem I can think of with "do whatever" is that this
>>>> will likely include mknod of char and block devices which you can then
>>>> chown/chmod as you wish and use to access any devices on the system from
>>>> an unprivileged container.
>>>> This can however be mitigated by using the devices cgroup controller.
>>>
>>>
>>> Or 'nodev'. setuid/setgid may have the same problem, too.
>>>
>>> Implementing something like this would also make CAP_DAC_READ_SEARCH
>>> and CAP_DAC_OVERRIDE work.
>>>
>>> Arguably it should be impossible to mount such a thing in the first
>>> place without global privilege.
>>>
>>>>
>>>> You also probably wouldn't want any unprivileged user from the host to
>>>> find a way to access that mounted filesytem but so long as you do the
>>>> mount in a separate mountns and don't share uids between the host and
>>>> the container, that should be fine too.
>>>
>>>
>>> This part should be a nonissue -- an unprivileged user who has the
>>> right uid owns the namespace anyway, so this is the least of your
>>> worries.
>>>
>>> --Andy
>>
>>
>> It should be a nonissue so long as we make sure that a file owned by a
>> uid outside the scope of the container may not be changed even though
>> fs_owner_uid is set. Otherwise, it's just a matter of chmod +S on say
>> a shell and anyone who can see the fs from the host will be getting a
>> root shell (assuming said file is owned by the host's uid 0).
>>
>> So that's restricting slightly what "do whatever" would do in this case.
>>
>
> In my case I give an LVM volume to each container and limit the container to
> only this block device using the devices cgroup.
> So the inode_capable() fix worked like a charm for me.
> The container can not see any filesystem other then its own.
> And I have another patch for my kernel that prohibits setns from cgroup
> other then / which prevents programs from the container from getting out.
> clone() can be used to create new namespaces but can not be used to attach
> to already created namespaces.

Doesn't matter -- the risk here is that an attacker outside the
namespace can get an fd that points to a directory in the namespace.
SCM_RIGHTS would be the major vector.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/