Re: Questions re the new mount_setattr(2) manual page

From: Michael Kerrisk (man-pages)
Date: Thu Aug 12 2021 - 21:25:11 EST

Next message: Joel Stanley: "Re: [PATCH v2 4/6] ARM: dts: aspeed: Add Facebook Cloudripper (AST2600) BMC"
Previous message: Stephen Rothwell: "linux-next: manual merge of the net-next tree with the net tree"
In reply to: Christian Brauner: "Re: Questions re the new mount_setattr(2) manual page"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello Christian,

On 8/12/21 10:38 AM, Christian Brauner wrote:
> On Thu, Aug 12, 2021 at 07:36:54AM +0200, Michael Kerrisk (man-pages) wrote:
>> [CC += Eric, in case he has a comment on the last piece]

[...]

>>> That's really splitting hairs.
>>
>> To be clear, I'm not trying to split hairs :-). It's just that
>> I'm struggling a little to understand. (In particular, the notion
>> of locked mounts is one where my understanding is weak.)
>>
>> And think of it like this: I am the first line of defense for the
>> user-space reader. If I am having trouble to understand the text,
>> I wont be alone. And often, the problem is not so much that the
>> text is "wrong", it's that there's a difference in background
>> knowledge between what you know and what the reader (in this case
>> me) knows. Part of my task is to fill that gap, by adding info
>> that I think is necessary to the page (with the happy side
>> effect that I learn along the way.)
>
> All very good points.
> I didn't mean to complain btw. Sorry that it seemed that way. :)

No problem. I need to think more carefully about my words
sometimes in mails too :-)

>>> Of course this means that we're
>>> propagating into a mount namespace that is owned by a different user
>>> namespace though "crossing user namespaces" might have been the better
>>> choice.
>>
>> This is a perfect example of the point I make above. You say "of course",
>> but I don't have the background knowledge that you do :-). From my
>> perspective, I want to make sure that I understand your meaning, so
>> that that meaning can (IMHO) be made easier for the average reader
>> of the manual page.
>>
>>>> the aforementioned flags to protect these sensitive
>>>> properties from being altered.
>>>>
>>>> • A new mount and user namespace pair is created. This
>>>> happens for example when specifying CLONE_NEWUSER |
>>>> CLONE_NEWNS in unshare(2), clone(2), or clone3(2). The
>>>> aforementioned flags become locked to protect user name‐
>>>> spaces from altering sensitive mount properties.
>>>>
>>>> Again, this seems imprecise. Should it say something like:
>>>> "... to prevent changes to sensitive mount properties in the new
>>>> mount namespace" ? Or perhaps you have a better wording.
>>>
>>> That's not imprecise.
>>
>> Okay -- poor choice of wording on my part:
>>
>> s/this seems imprecise/I'm having trouble understanding this/
>>
>>> What you want to protect against is altering
>>> sensitive mount properties from within a user namespace irrespective of
>>> whether or not the user namespace actually owns the mount namespace,
>>> i.e. even if you own the mount namespace you shouldn't be able to alter
>>> those properties. I concede though that "protect" should've been
>>> "prevent".
>>
>> Can I check my education here please. The point is this:
>>
>> * The mount point was created in a mount NS that was owned by
>> a more privileged user NS (e.g., the initial user NS).
>> * A CLONE_NEWUSER|CLONE_NEWNS step occurs to create a new (user and)
>> mount NS.
>> * In the new mount NS, the mounts become locked.
>>
>> And, help me here: is it correct that the reason the properties
>> need to be locked is because they are shared between the mounts?
>
> Yes, basically.

Yes, but that last sentence of mine was wrong, wasn't it? The
properties are not actually shared between the mounts, right?
(Earlier, I had done in experiment which misled e into thinking
there was sharing, but now it looks to me like there is not.)

> The new mount namespace contains a copy of all the mounts in the
> previous mount namespace. So they are separate mounts which you can best
> see when you do unshare --mount --propagation=private. An unmount in the
> new mount namespace won't affect the mount in the previous mount
> namespace. Which can only nicely work if they are separate mounts.
> Propagation relies (among other things) on the fact that mount
> namespaces have copies of the mounts.
>
> The copied mounts in the new mount namespace will have inherited all
> properties they had at the time when copy_namespaces() and specifically
> copy_mnt_ns() was called. Which calls into copy_tree() and ultimately
> into the appropriately named clone_mnt(). This is the low-level routine
> that is responsible for cloning the mounts including their mount
> properties.
>
> Some mount properties such as read-only, nodev, noexec, nosuid, atime -
> while arguably not per se security mechanisms - are used for protection
> or as security measures in userspace applications. The most obvious one
> might be the read-only property. One wouldn't want to expose a set of
> files as read-only only for someone else to trivially gain write access
> to them. An example of where that could happen is when creating a new
> mount namespaces and user namespace pair where the new mount namespace
> is owned by the new user namespace in which the caller is privileged and
> thus the caller would also able to alter the new mount namespace. So
> without locking flags all it would take to turn a read-only into a
> read-write mount is:
> unshare -U --map-root --propagation=private -- mount -o remount,rw /some/mnt
> locking such flags prevents that from happening.

Thanks for the detailed explanation; it's very helpful.

>>> You could probably say:
>>>
>>> A new mount and user namespace pair is created. This
>>> happens for example when specifying CLONE_NEWUSER |
>>> CLONE_NEWNS in unshare(2), clone(2), or clone3(2).
>>> The aforementioned flags become locked in the new mount
>>> namespace to prevent sensitive mount properties from being
>>> altered.
>>> Since the newly created mount namespace will be owned by the
>>> newly created user namespace a caller privileged in the newly
>>> created user namespace would be able to alter senstive
>>> mount properties. For example, without locking the read-only
>>> property for the mounts in the new mount namespace such a caller
>>> would be able to remount them read-write.
>>
>> So, I've now made the text:
>>
>> EPERM One of the mounts had at least one of MOUNT_ATTR_NOATIME,
>> MOUNT_ATTR_NODEV, MOUNT_ATTR_NODIRATIME, MOUNT_ATTR_NOEXEC,
>> MOUNT_ATTR_NOSUID, or MOUNT_ATTR_RDONLY set and the flag is
>> locked. Mount attributes become locked on a mount if:
>>
>> • A new mount or mount tree is created causing mount
>> propagation across user namespaces (i.e., propagation to
>> a mount namespace owned by a different user namespace).
>> The kernel will lock the aforementioned flags to prevent
>> these sensitive properties from being altered.
>>
>> • A new mount and user namespace pair is created. This
>> happens for example when specifying CLONE_NEWUSER |
>> CLONE_NEWNS in unshare(2), clone(2), or clone3(2). The
>> aforementioned flags become locked in the new mount
>> namespace to prevent sensitive mount properties from
>> being altered. Since the newly created mount namespace
>> will be owned by the newly created user namespace, a
>> calling process that is privileged in the new user
>> namespace would—in the absence of such locking—be able
>> to alter senstive mount properties (e.g., to remount a
>> mount that was marked read-only as read-write in the new
>> mount namespace).
>>
>> Okay?
>
> Sounds good.

Okay.

>>> (Fwiw, in this scenario there's a bit of (moderately sane) strangeness.
>>> A CLONE_NEWUSER | CLONE_NEWMNT will cause even stronger protection to
>>> kick in. For all mounts not marked as expired MNT_LOCKED will be set
>>> which means that a umount() on any such mount copied from the previous
>>> mount namespace will yield EINVAL implying from userspace' perspective
>>> it's not mounted - granted EINVAL is the ioctl() of multiplexing errnos
>>> - whereas a remount to alter a locked flag will yield EPERM.)
>>
>> Thanks for educating me! So, is that what we are seeing below?

(Was your silence to the above question an implicit "yes"?)

>> $ sudo umount /mnt/m1
>> $ sudo mount -t tmpfs none /mnt/m1
>> $ sudo unshare -pf -Ur -m --mount-proc strace -o /tmp/log umount /mnt/m1
>> umount: /mnt/m1: not mounted.
>> $ grep ^umount /tmp/log
>> umount2("/mnt/m1", 0) = -1 EINVAL (Invalid argument)
>>
>> The mount_namespaces(7) page has for a log time had this text:
>>
>> * Mounts that come as a single unit from a more privileged mount
>> namespace are locked together and may not be separated in a
>> less privileged mount namespace. (The unshare(2) CLONE_NEWNS
>> operation brings across all of the mounts from the original
>> mount namespace as a single unit, and recursive mounts that
>> propagate between mount namespaces propagate as a single unit.)
>>
>> I have had trouble understanding that. But maybe you just helped.
>> Is that text relevant to what you just wrote above? In particular,
>> I have trouble understanding what "separated" means. But, perhaps
>
> The text gives the "how" not the "why".

Yes, that's a big problem :-}.

> Consider a more elaborate mount tree where e.g., you have bind-mounted a
> mount over a subdirectory of another mount:
>
> sudo mount -t tmpfs /mnt
> sudo mkdir /mnt/my-dir/
> sudo touch /mnt/my-dir/my-file
> sudo mount --bind /opt /mnt/my-dir
>
> The files underneath /mnt/my-dir are now hidden. Consider what would
> happen if one would allow to address those mounts separately. A user
> could then do:
>
> unshare -U --map-root --mount
> umount /mnt/my-dir
> cat /mnt/my-dir/my-file
>
> giving them access to what's in my-dir.
>
> Treating such mount trees as a unit in less privileged mount namespaces
> (cf. [1]) prevents that, i.e., prevents revealing files and directories
> that were overmounted.

Got it!

> Treating such mounts as a unit is also relevant when e.g. bind-mounting
> a mount tree containing locked mounts. Sticking with the example above:
>
> unshare -U --map-root --mount
>
> # non-recursive bind-mount will fail
> mount --bind /mnt /tmp
>
> # recursive bind-mount will succeed
> mount --rbind /mnt /tmp
>
> The reason is again that the mount tree at /mnt is treated as a mount
> unit because it is locked. If one were to allow to non-recursively
> bind-mountng /mnt somewhere it would mean revealing what's underneath
> the mount at my-dir (This is in some sense the inverse of preventing a
> filesystem from being mounted that isn't fully visible, i.e. contains
> hidden or over-mounted mounts.).

Got it!

> These semantics, in addition to being security relevant, also allow a
> more privileged mount namespace to create a restricted view of the
> filesystem hierarchy that can't be circumvented in a less privileged
> mount namespace (Otherwise pivot_root would have to be used which can
> also be used to guarantee a restriced view on the filesystem hierarchy
> especially when combined with a separate rootfs.).

Okay.

Christian, thanks for so generously taking the time to write this up.
It really helped me a lot! I will do some work on the mount namespaces
manual page, to cover at least part of what you said.

Thanks,

Michael

> Christian
>
> [1]: I'll avoid jumping through the hoops of speaking about ownership
> all the time now for the sake of brevity. Otherwise I'll still sit
> here at lunchtime.
>

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Next message: Joel Stanley: "Re: [PATCH v2 4/6] ARM: dts: aspeed: Add Facebook Cloudripper (AST2600) BMC"
Previous message: Stephen Rothwell: "linux-next: manual merge of the net-next tree with the net tree"
In reply to: Christian Brauner: "Re: Questions re the new mount_setattr(2) manual page"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]