Re: [PATCHi, man-pages] mount_namespaces.7: More clearly explain "locked mounts"

From: Michael Kerrisk (man-pages)
Date: Wed Aug 18 2021 - 20:23:03 EST


Hello Eric,

Thank you for you response.

On 8/17/21 5:51 PM, Eric W. Biederman wrote:
> "Michael Kerrisk (man-pages)" <mtk.manpages@xxxxxxxxx> writes:
>
>> Hi Eric,
>>
>> Thanks for your feedback!
>>
>> On 8/16/21 6:03 PM, Eric W. Biederman wrote:
>>> Michael Kerrisk <mtk.manpages@xxxxxxxxx> writes:
>>>
>>>> For a long time, this manual page has had a brief discussion of
>>>> "locked" mounts, without clearly saying what this concept is, or
>>>> why it exists. Expand the discussion with an explanation of what
>>>> locked mounts are, why mounts are locked, and some examples of the
>>>> effect of locking.
>>>>
>>>> Thanks to Christian Brauner for a lot of help in understanding
>>>> these details.
>>>>
>>>> Reported-by: Christian Brauner <christian.brauner@xxxxxxxxxx>
>>>> Signed-off-by: Michael Kerrisk <mtk.manpages@xxxxxxxxx>
>>>> ---
>>>>
>>>> Hello Eric and others,
>>>>
>>>> After some quite helpful info from Chrstian Brauner, I've expanded
>>>> the discussion of locked mounts (a concept I didn't really have a
>>>> good grasp on) in the mount_namespaces(7) manual page. I would be
>>>> grateful to receive review comments, acks, etc., on the patch below.
>>>> Could you take a look please?
>>>>
>>>> Cheers,
>>>>
>>>> Michael
>>>>
>>>> man7/mount_namespaces.7 | 73 +++++++++++++++++++++++++++++++++++++++++
>>>> 1 file changed, 73 insertions(+)
>>>>
>>>> diff --git a/man7/mount_namespaces.7 b/man7/mount_namespaces.7
>>>> index e3468bdb7..97427c9ea 100644
>>>> --- a/man7/mount_namespaces.7
>>>> +++ b/man7/mount_namespaces.7
>>>> @@ -107,6 +107,62 @@ operation brings across all of the mounts from the original
>>>> mount namespace as a single unit,
>>>> and recursive mounts that propagate between
>>>> mount namespaces propagate as a single unit.)
>>>> +.IP
>>>> +In this context, "may not be separated" means that the mounts
>>>> +are locked so that they may not be individually unmounted.
>>>> +Consider the following example:
>>>> +.IP
>>>> +.RS
>>>> +.in +4n
>>>> +.EX
>>>> +$ \fBsudo mkdir /mnt/dir\fP
>>>> +$ \fBsudo sh \-c \(aqecho "aaaaaa" > /mnt/dir/a\(aq\fP
>>>> +$ \fBsudo mount \-\-bind -o ro /some/path /mnt/dir\fP
>>>> +$ \fBls /mnt/dir\fP # Former contents of directory are invisible
>>>
>>> Do we want a more motivating example such as a /proc/sys?
>>>
>>> It has been common to mount over /proc files and directories that can be
>>> written to by the global root so that users in a mount namespace may not
>>> touch them.
>>
>> Seems reasonable. But I want to check one thing. Can you please
>> define "global root". I'm pretty sure I know what you mean, but
>> I'd like to know your definition.
>
> I mean uid 0 in the initial user namespace.

(Good. That's what I thought you meant. So far, that term is not
described in the manual pages. I just now added a definition of the
term to user_namespaces(7).)

> This uid owns most of files in /proc.
>
> Container systems that don't want to use user namespaces frequently
> mount over files in proc to prevent using some of the root privileges
> that come simply by having uid 0.
>
> Another use is mounting over files on virtual filesystems like proc
> to reduce the attack surface.

Thanks for the background. I think for the moment I will go with
Christian's alternative suggestion (an example using /etc/shadow).

> For reducing what the root user in a container can do, I think using user
> namespaces and using a uid other than 0 in the initial user namespace.
>
>
>>>> +.EE
>>>> +.in
>>>> +.RE
>>>> +.IP
>>>> +The above steps, performed in a more privileged user namespace,
>>>> +have created a (read-only) bind mount that
>>>> +obscures the contents of the directory
>>>> +.IR /mnt/dir .
>>>> +For security reasons, it should not be possible to unmount
>>>> +that mount in a less privileged user namespace,
>>>> +since that would reveal the contents of the directory
>>>> +.IR /mnt/dir .
>>> > +.IP
>>>> +Suppose we now create a new mount namespace
>>>> +owned by a (new) subordinate user namespace.
>>>> +The new mount namespace will inherit copies of all of the mounts
>>>> +from the previous mount namespace.
>>>> +However, those mounts will be locked because the new mount namespace
>>>> +is owned by a less privileged user namespace.
>>>> +Consequently, an attempt to unmount the mount fails:
>>>> +.IP
>>>> +.RS
>>>> +.in +4n
>>>> +.EX
>>>> +$ \fBsudo unshare \-\-user \-\-map\-root\-user \-\-mount \e\fP
>>>> + \fBstrace \-o /tmp/log \e\fP
>>>> + \fBumount /mnt/dir\fP
>>>> +umount: /mnt/dir: not mounted.
>>>> +$ \fBgrep \(aq^umount\(aq /tmp/log\fP
>>>> +umount2("/mnt/dir", 0) = \-1 EINVAL (Invalid argument)
>>>> +.EE
>>>> +.in
>>>> +.RE
>>>> +.IP
>>>> +The error message from
>>>> +.BR mount (8)
>>>> +is a little confusing, but the
>>>> +.BR strace (1)
>>>> +output reveals that the underlying
>>>> +.BR umount2 (2)
>>>> +system call failed with the error
>>>> +.BR EINVAL ,
>>>> +which is the error that the kernel returns to indicate that
>>>> +the mount is locked.
>>>
>>> Do you want to mention that you can unmount the entire subtree? Either
>>> with pivot_root if it is locked to "/" or with
>>> "umount -l /path/to/propagated/directory".
>>
>> Yes, I wondered about that, but hadn't got round to devising
>> the scenario. How about this:
>>
>> [[
>> * Following on from the previous point, note that it is possible
>> to unmount an entire tree of mounts that propagated as a unit
> ^^^^^ subtree?

Yes, probably better, to prevent misunderstandings. Changed (and in a few
other places also).

>> into a mount namespace that is owned by a less privileged user
>> namespace, as illustrated in the following example.
>
>>
>> First, we create new user and mount namespaces using
>> unshare(1). In the new mount namespace, the propagation type
>> of all mounts is set to private. We then create a shared bind
>> mount at /mnt, and a small hierarchy of mount points underneath
>> that mount point.
>>
>> $ PS1='ns1# ' sudo unshare --user --map-root-user \
>> --mount --propagation private bash
>> ns1# echo $$ # We need the PID of this shell later
>> 778501
>> ns1# mount --make-shared --bind /mnt /mnt
>> ns1# mkdir /mnt/x
>> ns1# mount --make-private -t tmpfs none /mnt/x
>> ns1# mkdir /mnt/x/y
>> ns1# mount --make-private -t tmpfs none /mnt/x/y
>> ns1# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
>> 986 83 8:5 /mnt /mnt rw,relatime shared:344
>> 989 986 0:56 / /mnt/x rw,relatime
>> 990 989 0:57 / /mnt/x/y rw,relatime
>>
>> Continuing in the same shell session, we then create a second
>> shell in a new mount namespace and a new subordinate (and thus
>> less privileged) user namespace and check the state of the
>> propagated mount points rooted at /mnt.
>>
>> ns1# PS1='ns2# unshare --user --map-root-user \
>> --mount --propagation unchanged bash
>> ns2# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
>> 1239 1204 8:5 /mnt /mnt rw,relatime master:344
>> 1240 1239 0:56 / /mnt/x rw,relatime
>> 1241 1240 0:57 / /mnt/x/y rw,relatime
>>
>> Of note in the above output is that the propagation type of the
>> mount point /mnt has been reduced to slave, as explained near
>> the start of this subsection. This means that submount events
>> will propagate from the master /mnt in "ns1", but propagation
>> will not occur in the opposite direction.
>>
>> From a separate terminal window, we then use nsenter(1) to
>> enter the mount and user namespaces corresponding to "ns1". In
>> that terminal window, we then recursively bind mount /mnt/x at
>> the location /mnt/ppp.
>>
>> $ PS1='ns3# ' sudo nsenter -t 778501 --user --mount
>> ns3# mount --rbind --make-private /mnt/x /mnt/ppp
>> ns3# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
>> 986 83 8:5 /mnt /mnt rw,relatime shared:344
>> 989 986 0:56 / /mnt/x rw,relatime
>> 990 989 0:57 / /mnt/x/y rw,relatime
>> 1242 986 0:56 / /mnt/ppp rw,relatime
>> 1243 1242 0:57 / /mnt/ppp/y rw,relatime shared:518
>>
>> Because the propagation type of the parent mount, /mnt, was
>> shared, the recursive bind mount propagated a small tree of
>> mounts under the slave mount /mnt into "ns2", as can be
>> verified by executing the following command in that shell
>> session:
>>
>> ns2# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
>> 1239 1204 8:5 /mnt /mnt rw,relatime master:344
>> 1240 1239 0:56 / /mnt/x rw,relatime
>> 1241 1240 0:57 / /mnt/x/y rw,relatime
>> 1244 1239 0:56 / /mnt/ppp rw,relatime
>> 1245 1244 0:57 / /mnt/ppp/y rw,relatime master:518
>>
>> While it is not possible to unmount a part of that propagated
>> subtree (/mnt/ppp/y), it is possible to unmount the entire
>> tree, as shown by the following commands:
>>
>> ns2# umount /mnt/ppp/y
>> umount: /mnt/ppp/y: not mounted.
>> ns2# umount -l /mnt/ppp | sed 's/ - .*//' # Succeeds...
>> ns2# grep /mnt /proc/self/mountinfo
>> 1239 1204 8:5 /mnt /mnt rw,relatime master:344
>> 1240 1239 0:56 / /mnt/x rw,relatime
>> 1241 1240 0:57 / /mnt/x/y rw,relatime
>> ]]
>>
>> ?
>
> Yes.
>
> It is worth noting that in ns2 it is also possible to mount on top of
> /mnt/ppp/y and umount from /mnt/ppp/y.

Yes, good point. I've added some text, and an example for that case.

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/