Re: MNT_DETACH and mount namespace issue

From: Eric W. Biederman
Date: Mon Aug 04 2014 - 12:50:36 EST


Richard Weinberger <richard.weinberger@xxxxxxxxx> writes:

> On Sat, Aug 2, 2014 at 12:09 AM, Eric W. Biederman
> <ebiederm@xxxxxxxxxxxx> wrote:
>> Richard Weinberger <richard@xxxxxx> writes:
>>
>>> Am 01.08.2014 17:44, schrieb Ram Pai:
>>>> On Fri, Aug 01, 2014 at 12:17:13AM +0200, Richard Weinberger wrote:
>>>>> Am 30.07.2014 22:46, schrieb Richard Weinberger:
>>>>>> Am 30.07.2014 15:59, schrieb Richard Weinberger:
>>>>>>> If we use the plain list_empty() we might not see the
>>>>>>> hlist_del_init_rcu() and therefore miss one member of the
>>>>>>> list.
>>>>>>>
>>>>>>> It fixes the following issue:
>>>>>>> $ unshare -m /usr/bin/sleep 10000 &
>>>>>>> $ mkdir -p foo/proc
>>>>>>> $ mount -t proc none foo/proc
>>>>>>> $ mount -t binfmt_misc none foo/proc/sys/fs/binfmt_misc
>>>>>>> $ umount -l foo/proc
>>>>>>> $ rmdir foo/proc
>>>>>>> rmdir: failed to remove âfoo/procâ: Device or resource busy
>>>>>>
>>>>>> Although my fix was wrong, the issue is real, it seems to exist for a very long
>>>>>> time. Just was able to reproduce it on 2.6.32.
>>>>>> Please note that you need a shared root subtree to trigger the issue.
>>>>>> i.e. mount --shared /
>>>>>> Maybe this is why nobody noticed it so far as only systemd distros
>>>>>> have the root subtree shared by default.
>>>>>>
>>>>>> I hit the issue on openSUSE 13.1 where an application creates a chroot environment
>>>>>> and then lazy umounts /proc.
>>>>>> It happened on very few machines. An analysis showed that only boxes with an OpenVPN tunnel
>>>>>> were affected. This did not make any sense until I discovered that the OpenVPN systemd
>>>>>> service file has set "PrivateTmp=true". This setting creates
>>>>>> a mount namespace for the said service...
>>>>>>
>>>>>> In __propagate_umount() the following piece of code is interesting:
>>>>>>
>>>>>> /*
>>>>>> * umount the child only if the child has no
>>>>>> * other children
>>>>>> */
>>>>>> if (child && list_empty(&child->mnt_mounts)) {
>>>>>> hlist_del_init_rcu(&child->mnt_hash);
>>>>>> hlist_add_before_rcu(&child->mnt_hash, &mnt->mnt_hash);
>>>>>> }
>>>>>>
>>>>>> child->mnt_mounts is non-empty for the "proc" although the "binfmt_misc"
>>>>>> subtree was removed.
>>>>>> I'm not sure whether this is only one more symptom or the main culprit.
>>>>>
>>>>> CC'ing Ram Pai.
>>>>>
>>>>> Ram, you are the author of the said code. Can you please explain why we need that
>>>>> list_empty() check?
>>>>> To my (limited) understanding of VFS, the following change should be fine to fix the issue:
>>>>
>>>> We had made a rule then, that busy vfsmounts cannot be lazily unmounted
>>>> **implicitly**. Propagated unmounts are implicit unmounts, and if such
>>>> implicit vfsmounts have child-mounts than obviously they are busy, and
>>>> hence they cannot be lazy-unmounted implicitly.
>>>>
>>>> the list_empty() is checking for no child-mounts on the vfsmount before
>>>> letting it unmount.
>>>>
>>>> We did not want a bunch of mounts disappear without the users knowledge.
>>>> Hence we made the above rule.
>>>>
>>>> Al Viro, will have more insights into this.
>>>
>>> Hmm, with the root subtree shared by default this policy will be problematic and
>>> lead to problems.
>>> As I observe on openSUSE 13.1.
>>>
>>> Al, what do you think?
>>
>> I have a pending patchset that causes the rmdir to cause all of the
>> mounts to go away. It has passed review and has not been merged only
>> because of stack overflow concerns (which I have not had time to fully
>> address).
>>
>> Sigh. It badly breaks unix semantics for rmdir unlink with no mounts in
>> the local namespace to fail, and it introduces as denial of service
>> attack from unprivielged users.
>
> Thanks for the pointer!
> I fear your patch series is nothing we can easily feed into -stable.
> Is this really the only acceptable solution?

I am not really certain. What I know is one of these days I need to
take the time and get that merged. It isn't going to be in 3.17
unfortunately :(

What I do know is if you are asking questions about sane semantics my
patch and the approach it takes feeds in.

It sounds like your problem is with lazy recursive unmounts not being
propogated because there is a chance that in the destination there might
be something mounted on top.

> The thing is, users get already bitten by that.

The issue I am dealing with has been biting users for years.
as root rm -rf dir failing with EBUSY because of a stale process in
the mount namespace is pretty horrid.

> i.e. run openSUSE's KIWI tool on a machine where a systemd service
> with PrivateTmp=yes is installed
> and you'll end up with a stale mount point.
> KIWI creates a chroot, populates /proc, lazily unmounts it later and
> then fails to remove the temporary chroot directory
> because of EBUSY.

Hmm. That problem does sound familiar.

Is the problem oversharing? Is the problem that the mount of /proc in
the chroot directory is propogating into other mount namespaces that
don't care?

If the problem is over propogating I would argue that KIWI needs to
use a mount namespace instead of a chroot and to tweak the mount
namespace so the mount of /proc simply does not leak out.

Not that the kernel doesn't need to be fixed but that a design where
mounts propogate everywhere is a problem in userspace anyway, and it is
likely going to only need to be a handful of lines of code to fix
userspace cleanly. While the kernel fix or fixes are going to require a
bit more time.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/