Re: [PATCH v8 1/1] ns: add binfmt_misc to the user namespace
From: Laurent Vivier
Date: Mon Dec 16 2019 - 15:05:33 EST
Le 16/12/2019 Ã 20:08, Jann Horn a ÃcritÂ:
> On Mon, Dec 16, 2019 at 10:12 AM Laurent Vivier <laurent@xxxxxxxxx> wrote:
>> This patch allows to have a different binfmt_misc configuration
>> for each new user namespace. By default, the binfmt_misc configuration
>> is the one of the previous level, but if the binfmt_misc filesystem is
>> mounted in the new namespace a new empty binfmt instance is created and
>> used in this namespace.
>>
>> For instance, using "unshare" we can start a chroot of another
>> architecture and configure the binfmt_misc interpreter without being root
>> to run the binaries in this chroot.
>
> How do you ensure that when userspace is no longer using the user
> namespace and mount namespace, the entries and the binfmt_misc
> superblock are deleted? As far as I can tell from looking at the code,
> at the moment, if I create a user namespace+mount namespace, mount
> binfmt_misc in there, register a file format and then let all
> processes inside the namespaces exit, the binfmt_misc mount will be
> kept alive by the simple_pin_fs() stuff, and the binfmt_misc entries
> will also stay in memory.
>
> [...]
Do you have any idea how I can fix this issue?
>> @@ -718,7 +736,9 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
>> if (!inode)
>> goto out2;
>>
>> - err = simple_pin_fs(&bm_fs_type, &bm_mnt, &entry_count);
>> + ns = binfmt_ns(file_dentry(file)->d_sb->s_user_ns);
>> + err = simple_pin_fs(&bm_fs_type, &ns->bm_mnt,
>> + &ns->entry_count);
>
> When you call simple_pin_fs() here, the user namespace of `current`
> and the user namespace of the superblock are not necessarily related.
> So simple_pin_fs() may end up taking a reference on the mountpoint for
> a user namespace that has nothing to do with the namespace for which
> an entry is being created.
Do you have any idea how I can fix this issue?
>
> [...]
>> static int bm_fill_super(struct super_block *sb, struct fs_context *fc)
>> {
>> int err;
>> + struct user_namespace *ns = sb->s_user_ns;
> [...]
>> + /* create a new binfmt namespace
>> + * if we are not in the first user namespace
>> + * but the binfmt namespace is the first one
>> + */
>> + if (READ_ONCE(ns->binfmt_ns) == NULL) {
>
> The READ_ONCE() here is unnecessary, right? AFAIK the VFS layer is
> going to ensure that bm_fill_super() can't run concurrently for the
> same namespace?
So I understand the "READ_ONCE()" is unnecessary and I will remove it.
>
>> + struct binfmt_namespace *new_ns;
>> +
>> + new_ns = kmalloc(sizeof(struct binfmt_namespace),
>> + GFP_KERNEL);
>> + if (new_ns == NULL)
>> + return -ENOMEM;
>> + INIT_LIST_HEAD(&new_ns->entries);
>> + new_ns->enabled = 1;
>> + rwlock_init(&new_ns->entries_lock);
>> + new_ns->bm_mnt = NULL;
>> + new_ns->entry_count = 0;
>> + /* ensure new_ns is completely initialized before sharing it */
>> + smp_wmb();
>> + WRITE_ONCE(ns->binfmt_ns, new_ns);
>
> Nit: This would be a little bit semantically clearer if you used
> smp_store_release() instead of smp_wmb()+WRITE_ONCE().
I will.
>
>> + }
>> +
>> err = simple_fill_super(sb, BINFMTFS_MAGIC, bm_files);
> [...]
>> +static void bm_free(struct fs_context *fc)
>> +{
>> + if (fc->s_fs_info)
>> + put_user_ns(fc->s_fs_info);
>> +}
>
> Silly question: Why the "if"? Can you ever reach this with fc->s_fs_info==NULL?
So I understand the if is unnecessary and I will remove it.
>
>> +
>> static int bm_get_tree(struct fs_context *fc)
>> {
>> - return get_tree_single(fc, bm_fill_super);
>> + return get_tree_keyed(fc, bm_fill_super, get_user_ns(fc->user_ns));
>
> get_user_ns() increments the refcount of the namespace, but in the
> case where a binfmt_misc mount already exists, that refcount is never
> dropped, right? That would be a security bug, since an attacker could
> overflow the refcount of the user namespace and then trigger a UAF.
> (And the refcount hardening won't catch it because user namespaces
> still use raw atomics instead of refcount_t.)
Do you have any idea how I can fix this issue?
> [...]
>> +#if IS_ENABLED(CONFIG_BINFMT_MISC)
>
> Nit: Isn't this kind of check normally written as "#ifdef"?
>
What is the difference?
Thanks,
Laurent