Re: [PATCH RESEND v11 2/8] proc: allow to mount many instances of proc in one pid namespace

From: Eric W. Biederman
Date: Fri Apr 17 2020 - 14:58:18 EST


Alexey Gladkov <gladkov.alexey@xxxxxxxxx> writes:

> This patch allows to have multiple procfs instances inside the
> same pid namespace. The aim here is lightweight sandboxes, and to allow
> that we have to modernize procfs internals.
>
> 1) The main aim of this work is to have on embedded systems one
> supervisor for apps. Right now we have some lightweight sandbox support,
> however if we create pid namespacess we have to manages all the
> processes inside too, where our goal is to be able to run a bunch of
> apps each one inside its own mount namespace without being able to
> notice each other. We only want to use mount namespaces, and we want
> procfs to behave more like a real mount point.
>
> 2) Linux Security Modules have multiple ptrace paths inside some
> subsystems, however inside procfs, the implementation does not guarantee
> that the ptrace() check which triggers the security_ptrace_check() hook
> will always run. We have the 'hidepid' mount option that can be used to
> force the ptrace_may_access() check inside has_pid_permissions() to run.
> The problem is that 'hidepid' is per pid namespace and not attached to
> the mount point, any remount or modification of 'hidepid' will propagate
> to all other procfs mounts.
>
> This also does not allow to support Yama LSM easily in desktop and user
> sessions. Yama ptrace scope which restricts ptrace and some other
> syscalls to be allowed only on inferiors, can be updated to have a
> per-task context, where the context will be inherited during fork(),
> clone() and preserved across execve(). If we support multiple private
> procfs instances, then we may force the ptrace_may_access() on
> /proc/<pids>/ to always run inside that new procfs instances. This will
> allow to specifiy on user sessions if we should populate procfs with
> pids that the user can ptrace or not.
>
> By using Yama ptrace scope, some restricted users will only be able to see
> inferiors inside /proc, they won't even be able to see their other
> processes. Some software like Chromium, Firefox's crash handler, Wine
> and others are already using Yama to restrict which processes can be
> ptracable. With this change this will give the possibility to restrict
> /proc/<pids>/ but more importantly this will give desktop users a
> generic and usuable way to specifiy which users should see all processes
> and which users can not.
>
> Side notes:
> * This covers the lack of seccomp where it is not able to parse
> arguments, it is easy to install a seccomp filter on direct syscalls
> that operate on pids, however /proc/<pid>/ is a Linux ABI using
> filesystem syscalls. With this change LSMs should be able to analyze
> open/read/write/close...
>
> In the new patchset version I removed the 'newinstance' option
> as suggested by Eric W. Biederman.

Some very small requests.

1) Can you please not place fs_info in fs_context, and instead allocate
fs_info in fill_super? Unless I have misread introduced a resource
leak if proc is not mounted or if proc is simply reconfigured.

2) Can you please move hide_pid and pid_gid into fs_info in this patch?
As was shown by my recent bug fix

3) Can you please rebase on on v5.7-rc1 or v5.7-rc2 and repost these
patches please? I thought I could do it safely but between my bug
fixes, and Alexey Dobriyan's parallel changes to proc these patches
do not apply cleanly.

Plus there is a resource leak in this patch.

Eric


> Signed-off-by: Alexey Gladkov <gladkov.alexey@xxxxxxxxx>
> Reviewed-by: Alexey Dobriyan <adobriyan@xxxxxxxxx>
> Reviewed-by: Kees Cook <keescook@xxxxxxxxxxxx>
> ---
> fs/proc/base.c | 13 +++++++----
> fs/proc/inode.c | 4 ++--
> fs/proc/root.c | 42 ++++++++++++++++++++++-------------
> fs/proc/self.c | 6 ++---
> fs/proc/thread_self.c | 6 ++---
> include/linux/pid_namespace.h | 4 ----
> include/linux/proc_fs.h | 12 ++++++++++
> 7 files changed, 55 insertions(+), 32 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 74f948a6b621..3b9155a69ade 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -3301,6 +3301,7 @@ struct dentry *proc_pid_lookup(struct dentry *dentry, unsigned int flags)
> {
> struct task_struct *task;
> unsigned tgid;
> + struct proc_fs_info *fs_info;
> struct pid_namespace *ns;
> struct dentry *result = ERR_PTR(-ENOENT);
>
> @@ -3308,7 +3309,8 @@ struct dentry *proc_pid_lookup(struct dentry *dentry, unsigned int flags)
> if (tgid == ~0U)
> goto out;
>
> - ns = dentry->d_sb->s_fs_info;
> + fs_info = proc_sb_info(dentry->d_sb);
> + ns = fs_info->pid_ns;
> rcu_read_lock();
> task = find_task_by_pid_ns(tgid, ns);
> if (task)
> @@ -3372,6 +3374,7 @@ static struct tgid_iter next_tgid(struct pid_namespace *ns, struct tgid_iter ite
> int proc_pid_readdir(struct file *file, struct dir_context *ctx)
> {
> struct tgid_iter iter;
> + struct proc_fs_info *fs_info = proc_sb_info(file_inode(file)->i_sb);
> struct pid_namespace *ns = proc_pid_ns(file_inode(file));
> loff_t pos = ctx->pos;
>
> @@ -3379,13 +3382,13 @@ int proc_pid_readdir(struct file *file, struct dir_context *ctx)
> return 0;
>
> if (pos == TGID_OFFSET - 2) {
> - struct inode *inode = d_inode(ns->proc_self);
> + struct inode *inode = d_inode(fs_info->proc_self);
> if (!dir_emit(ctx, "self", 4, inode->i_ino, DT_LNK))
> return 0;
> ctx->pos = pos = pos + 1;
> }
> if (pos == TGID_OFFSET - 1) {
> - struct inode *inode = d_inode(ns->proc_thread_self);
> + struct inode *inode = d_inode(fs_info->proc_thread_self);
> if (!dir_emit(ctx, "thread-self", 11, inode->i_ino, DT_LNK))
> return 0;
> ctx->pos = pos = pos + 1;
> @@ -3599,6 +3602,7 @@ static struct dentry *proc_task_lookup(struct inode *dir, struct dentry * dentry
> struct task_struct *task;
> struct task_struct *leader = get_proc_task(dir);
> unsigned tid;
> + struct proc_fs_info *fs_info;
> struct pid_namespace *ns;
> struct dentry *result = ERR_PTR(-ENOENT);
>
> @@ -3609,7 +3613,8 @@ static struct dentry *proc_task_lookup(struct inode *dir, struct dentry * dentry
> if (tid == ~0U)
> goto out;
>
> - ns = dentry->d_sb->s_fs_info;
> + fs_info = proc_sb_info(dentry->d_sb);
> + ns = fs_info->pid_ns;
> rcu_read_lock();
> task = find_task_by_pid_ns(tid, ns);
> if (task)
> diff --git a/fs/proc/inode.c b/fs/proc/inode.c
> index 1e730ea1dcd6..6e4c6728338b 100644
> --- a/fs/proc/inode.c
> +++ b/fs/proc/inode.c
> @@ -167,8 +167,8 @@ void proc_invalidate_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock
>
> static int proc_show_options(struct seq_file *seq, struct dentry *root)
> {
> - struct super_block *sb = root->d_sb;
> - struct pid_namespace *pid = sb->s_fs_info;
> + struct proc_fs_info *fs_info = proc_sb_info(root->d_sb);
> + struct pid_namespace *pid = fs_info->pid_ns;
>
> if (!gid_eq(pid->pid_gid, GLOBAL_ROOT_GID))
> seq_printf(seq, ",gid=%u", from_kgid_munged(&init_user_ns, pid->pid_gid));
> diff --git a/fs/proc/root.c b/fs/proc/root.c
> index 2633f10446c3..b28adbb0b937 100644
> --- a/fs/proc/root.c
> +++ b/fs/proc/root.c
> @@ -30,7 +30,7 @@
> #include "internal.h"
>
> struct proc_fs_context {
> - struct pid_namespace *pid_ns;
> + struct proc_fs_info *fs_info;
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Please don't do this. As best as I can tell that introduces a memory
leak of proc is not mounted. Please allocate fs_info in

> unsigned int mask;
> int hidepid;
> int gid;
> @@ -92,7 +92,8 @@ static void proc_apply_options(struct super_block *s,
>
> static int proc_fill_super(struct super_block *s, struct fs_context *fc)
> {
> - struct pid_namespace *pid_ns = get_pid_ns(s->s_fs_info);
> + struct proc_fs_context *ctx = fc->fs_private;
> + struct pid_namespace *pid_ns = get_pid_ns(ctx->fs_info->pid_ns);
> struct inode *root_inode;
> int ret;
>
> @@ -106,6 +107,7 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
> s->s_magic = PROC_SUPER_MAGIC;
> s->s_op = &proc_sops;
> s->s_time_gran = 1;
> + s->s_fs_info = ctx->fs_info;
>
> /*
> * procfs isn't actually a stacking filesystem; however, there is
> @@ -113,7 +115,7 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
> * top of it
> */
> s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
> -
> +
> /* procfs dentries and inodes don't require IO to create */
> s->s_shrink.seeks = 0;
>
> @@ -140,7 +142,8 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
> static int proc_reconfigure(struct fs_context *fc)
> {
> struct super_block *sb = fc->root->d_sb;
> - struct pid_namespace *pid = sb->s_fs_info;
> + struct proc_fs_info *fs_info = proc_sb_info(sb);
> + struct pid_namespace *pid = fs_info->pid_ns;
>
> sync_filesystem(sb);
>
> @@ -150,16 +153,14 @@ static int proc_reconfigure(struct fs_context *fc)
>
> static int proc_get_tree(struct fs_context *fc)
> {
> - struct proc_fs_context *ctx = fc->fs_private;
> -
> - return get_tree_keyed(fc, proc_fill_super, ctx->pid_ns);
> + return get_tree_nodev(fc, proc_fill_super);
> }
>
> static void proc_fs_context_free(struct fs_context *fc)
> {
> struct proc_fs_context *ctx = fc->fs_private;
>
> - put_pid_ns(ctx->pid_ns);
> + put_pid_ns(ctx->fs_info->pid_ns);
> kfree(ctx);
> }
>
> @@ -178,9 +179,15 @@ static int proc_init_fs_context(struct fs_context *fc)
> if (!ctx)
> return -ENOMEM;
>
> - ctx->pid_ns = get_pid_ns(task_active_pid_ns(current));
> + ctx->fs_info = kzalloc(sizeof(struct proc_fs_info), GFP_KERNEL);
> + if (!ctx->fs_info) {
> + kfree(ctx);
> + return -ENOMEM;
> + }
> +
> + ctx->fs_info->pid_ns = get_pid_ns(task_active_pid_ns(current));
> put_user_ns(fc->user_ns);
> - fc->user_ns = get_user_ns(ctx->pid_ns->user_ns);
> + fc->user_ns = get_user_ns(ctx->fs_info->pid_ns->user_ns);
> fc->fs_private = ctx;
> fc->ops = &proc_fs_context_ops;
> return 0;
> @@ -188,15 +195,18 @@ static int proc_init_fs_context(struct fs_context *fc)
>
> static void proc_kill_sb(struct super_block *sb)
> {
> - struct pid_namespace *ns;
> + struct proc_fs_info *fs_info = proc_sb_info(sb);
> + struct pid_namespace *ns = fs_info->pid_ns;
> +
> + if (fs_info->proc_self)
> + dput(fs_info->proc_self);
> +
> + if (fs_info->proc_thread_self)
> + dput(fs_info->proc_thread_self);
>
> - ns = (struct pid_namespace *)sb->s_fs_info;
> - if (ns->proc_self)
> - dput(ns->proc_self);
> - if (ns->proc_thread_self)
> - dput(ns->proc_thread_self);
> kill_anon_super(sb);
> put_pid_ns(ns);
> + kfree(fs_info);
> }
>
> static struct file_system_type proc_fs_type = {
> diff --git a/fs/proc/self.c b/fs/proc/self.c
> index 57c0a1047250..309301ac0136 100644
> --- a/fs/proc/self.c
> +++ b/fs/proc/self.c
> @@ -36,10 +36,10 @@ static unsigned self_inum __ro_after_init;
> int proc_setup_self(struct super_block *s)
> {
> struct inode *root_inode = d_inode(s->s_root);
> - struct pid_namespace *ns = proc_pid_ns(root_inode);
> + struct proc_fs_info *fs_info = proc_sb_info(s);
> struct dentry *self;
> int ret = -ENOMEM;
> -
> +
> inode_lock(root_inode);
> self = d_alloc_name(s->s_root, "self");
> if (self) {
> @@ -62,7 +62,7 @@ int proc_setup_self(struct super_block *s)
> if (ret)
> pr_err("proc_fill_super: can't allocate /proc/self\n");
> else
> - ns->proc_self = self;
> + fs_info->proc_self = self;
>
> return ret;
> }
> diff --git a/fs/proc/thread_self.c b/fs/proc/thread_self.c
> index f61ae53533f5..2493cbbdfa6f 100644
> --- a/fs/proc/thread_self.c
> +++ b/fs/proc/thread_self.c
> @@ -36,7 +36,7 @@ static unsigned thread_self_inum __ro_after_init;
> int proc_setup_thread_self(struct super_block *s)
> {
> struct inode *root_inode = d_inode(s->s_root);
> - struct pid_namespace *ns = proc_pid_ns(root_inode);
> + struct proc_fs_info *fs_info = proc_sb_info(s);
> struct dentry *thread_self;
> int ret = -ENOMEM;
>
> @@ -60,9 +60,9 @@ int proc_setup_thread_self(struct super_block *s)
> inode_unlock(root_inode);
>
> if (ret)
> - pr_err("proc_fill_super: can't allocate /proc/thread_self\n");
> + pr_err("proc_fill_super: can't allocate /proc/thread-self\n");
> else
> - ns->proc_thread_self = thread_self;
> + fs_info->proc_thread_self = thread_self;
>
> return ret;
> }
> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> index 4956e362e55e..de4534d93cb6 100644
> --- a/include/linux/pid_namespace.h
> +++ b/include/linux/pid_namespace.h
> @@ -32,10 +32,6 @@ struct pid_namespace {
> struct kmem_cache *pid_cachep;
> unsigned int level;
> struct pid_namespace *parent;
> -#ifdef CONFIG_PROC_FS
> - struct dentry *proc_self;
> - struct dentry *proc_thread_self;
> -#endif
> #ifdef CONFIG_BSD_PROCESS_ACCT
> struct fs_pin *bacct;
> #endif
> diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
> index 40a7982b7285..5920a4ecd71b 100644
> --- a/include/linux/proc_fs.h
> +++ b/include/linux/proc_fs.h
> @@ -27,6 +27,17 @@ struct proc_ops {
> unsigned long (*proc_get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
> };
>
> +struct proc_fs_info {
> + struct pid_namespace *pid_ns;
> + struct dentry *proc_self; /* For /proc/self */
> + struct dentry *proc_thread_self; /* For /proc/thread-self */
> +};
> +
> +static inline struct proc_fs_info *proc_sb_info(struct super_block *sb)
> +{
> + return sb->s_fs_info;
> +}
> +
> #ifdef CONFIG_PROC_FS
>
> typedef int (*proc_write_t)(struct file *, char *, size_t);
> @@ -161,6 +172,7 @@ int open_related_ns(struct ns_common *ns,
> /* get the associated pid namespace for a file in procfs */
> static inline struct pid_namespace *proc_pid_ns(const struct inode *inode)
> {
> + return proc_sb_info(inode->i_sb)->pid_ns;
> return inode->i_sb->s_fs_info;
> }