Re: [PATCH 2/2] shm: extend forced shm destroy to support objects from several IPC nses
From: Eric W. Biederman
Date: Sat Oct 30 2021 - 00:27:02 EST
Alexander Mikhalitsyn <alexander.mikhalitsyn@xxxxxxxxxxxxx> writes:
> Currently, exit_shm function not designed to work properly when
> task->sysvshm.shm_clist holds shm objects from different IPC namespaces.
>
> This is a real pain when sysctl kernel.shm_rmid_forced = 1, because
> it leads to use-after-free (reproducer exists).
>
> That particular patch is attempt to fix the problem by extending exit_shm
> mechanism to handle shm's destroy from several IPC ns'es.
>
> To achieve that we do several things:
> 1. add namespace (non-refcounted) pointer to the struct shmid_kernel
> 2. during new shm object creation (newseg()/shmget syscall) we initialize
> this pointer by current task IPC ns
> 3. exit_shm() fully reworked such that it traverses over all
> shp's in task->sysvshm.shm_clist and gets IPC namespace not
> from current task as it was before but from shp's object itself, then
> call shm_destroy(shp, ns).
>
> Note. We need to be really careful here, because as it was said before
> (1), our pointer to IPC ns non-refcnt'ed. To be on the safe side we using
> special helper get_ipc_ns_not_zero() which allows to get IPC ns refcounter
> only if IPC ns not in the "state of destruction".
> Q/A
>
> Q: Why we can access shp->ns memory using non-refcounted pointer?
> A: Because shp object lifetime is always shorther
> than IPC namespace lifetime, so, if we get shp object from the
> task->sysvshm.shm_clist while holding task_lock(task) nobody can
> steal our namespace.
Not true. A struct shmid_kernel can outlive the namespace in which it
was created. I you look at do_shm_rmid which is called when the
namespace is destroyed for every shmid_kernel in the namespace that if
the struct shmid_kernel still has users only ipc_set_key_private is
called. The struct shmid_kernel continues to exist.
> Q: Does this patch change semantics of unshare/setns/clone syscalls?
> A: Not. It's just fixes non-covered case when process may leave
> IPC namespace without getting task->sysvshm.shm_clist list cleaned up.
Just reading through exit_shm the code is not currently safe.
At a minimum do_shm_rmid needs to set the shp->ns to NULL. Otherwise
the struct shmid_kernel can contain a namespace pointer after
the namespace exits. Which results in a different use after free.
Beyond that there is dropping the task lock. The code holds a reference
to the namespace which means that the code does not need to worry about
free_ipcs. References from mappings are still possible.
Which means that the code could see:
exit_shm()
task_lock()
shp = ...;
task_unlock()
shm_close()
down_write(&shm_ids(ns).rwsem);
...
shm_destroy(shp);
up_write(&shm_ids(ns).rwsem);
down_write(&shm_ids(ns)->rwsem);
shm_lock_by_ptr(shp); /* use after free */
I am trying to imagine how to close that race with the current code
structure. Maybe something could be done by looking at shm_nattach
count and making it safe to look at that count under the task_lock.
But even then because shmid_kernel is still in the hash table it could
be mapped and unmapped in the window when task_lock was dropped.
Alternatively shmctl(id, IPC_RMID) can be called in when task_lock is
dropped. Much less code is involved than mapping and unmapping so it is
much more likely to win the race.
I don't see how that race can be closed.
Am I missing something?
Eric
> Fixes: ab602f79915 ("shm: make exit_shm work proportional to task activity")
>
> Cc: "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx>
> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> Cc: Davidlohr Bueso <dave@xxxxxxxxxxxx>
> Cc: Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx>
> Cc: Andrei Vagin <avagin@xxxxxxxxx>
> Cc: Pavel Tikhomirov <ptikhomirov@xxxxxxxxxxxxx>
> Cc: Vasily Averin <vvs@xxxxxxxxxxxxx>
> Cc: Manfred Spraul <manfred@xxxxxxxxxxxxxxxx>
> Cc: Alexander Mikhalitsyn <alexander@xxxxxxxxxxxxx>
> Cc: stable@xxxxxxxxxxxxxxx
> Co-developed-by: Manfred Spraul <manfred@xxxxxxxxxxxxxxxx>
> Signed-off-by: Manfred Spraul <manfred@xxxxxxxxxxxxxxxx>
> Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@xxxxxxxxxxxxx>
> ---
> include/linux/ipc_namespace.h | 15 +++
> include/linux/sched/task.h | 2 +-
> include/linux/shm.h | 2 +-
> ipc/shm.c | 170 +++++++++++++++++++++++++---------
> 4 files changed, 142 insertions(+), 47 deletions(-)
>
> diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
> index 05e22770af51..b75395ec8d52 100644
> --- a/include/linux/ipc_namespace.h
> +++ b/include/linux/ipc_namespace.h
> @@ -131,6 +131,16 @@ static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
> return ns;
> }
>
> +static inline struct ipc_namespace *get_ipc_ns_not_zero(struct ipc_namespace *ns)
> +{
> + if (ns) {
> + if (refcount_inc_not_zero(&ns->ns.count))
> + return ns;
> + }
> +
> + return NULL;
> +}
> +
> extern void put_ipc_ns(struct ipc_namespace *ns);
> #else
> static inline struct ipc_namespace *copy_ipcs(unsigned long flags,
> @@ -147,6 +157,11 @@ static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
> return ns;
> }
>
> +static inline struct ipc_namespace *get_ipc_ns_not_zero(struct ipc_namespace *ns)
> +{
> + return ns;
> +}
> +
> static inline void put_ipc_ns(struct ipc_namespace *ns)
> {
> }
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index ef02be869cf2..bfdf84dab4be 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -157,7 +157,7 @@ static inline struct vm_struct *task_stack_vm_area(const struct task_struct *t)
> * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring
> * subscriptions and synchronises with wait4(). Also used in procfs. Also
> * pins the final release of task.io_context. Also protects ->cpuset and
> - * ->cgroup.subsys[]. And ->vfork_done.
> + * ->cgroup.subsys[]. And ->vfork_done. And ->sysvshm.shm_clist.
> *
> * Nests both inside and outside of read_lock(&tasklist_lock).
> * It must not be nested with write_lock_irq(&tasklist_lock),
> diff --git a/include/linux/shm.h b/include/linux/shm.h
> index d8e69aed3d32..709f6d0451c0 100644
> --- a/include/linux/shm.h
> +++ b/include/linux/shm.h
> @@ -11,7 +11,7 @@ struct file;
>
> #ifdef CONFIG_SYSVIPC
> struct sysv_shm {
> - struct list_head shm_clist;
> + struct list_head shm_clist;
> };
>
> long do_shmat(int shmid, char __user *shmaddr, int shmflg, unsigned long *addr,
> diff --git a/ipc/shm.c b/ipc/shm.c
> index 748933e376ca..29667e17b12a 100644
> --- a/ipc/shm.c
> +++ b/ipc/shm.c
> @@ -62,9 +62,18 @@ struct shmid_kernel /* private to the kernel */
> struct pid *shm_lprid;
> struct ucounts *mlock_ucounts;
>
> - /* The task created the shm object. NULL if the task is dead. */
> + /*
> + * The task created the shm object, for looking up
> + * task->sysvshm.shm_clist_lock
> + */
> struct task_struct *shm_creator;
> - struct list_head shm_clist; /* list by creator */
> +
> + /*
> + * list by creator. shm_clist_lock required for read/write
> + * if list_empty(), then the creator is dead already
> + */
> + struct list_head shm_clist;
> + struct ipc_namespace *ns;
> } __randomize_layout;
>
> /* shm_mode upper byte flags */
> @@ -115,6 +124,7 @@ static void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
> struct shmid_kernel *shp;
>
> shp = container_of(ipcp, struct shmid_kernel, shm_perm);
> + WARN_ON(ns != shp->ns);
>
> if (shp->shm_nattch) {
> shp->shm_perm.mode |= SHM_DEST;
> @@ -225,10 +235,36 @@ static void shm_rcu_free(struct rcu_head *head)
> kfree(shp);
> }
>
> -static inline void shm_rmid(struct ipc_namespace *ns, struct shmid_kernel *s)
> +/*
> + * It has to be called with shp locked.
> + * It must be called before ipc_rmid()
> + */
> +static inline void shm_clist_rm(struct shmid_kernel *shp)
> {
> - list_del(&s->shm_clist);
> - ipc_rmid(&shm_ids(ns), &s->shm_perm);
> + struct task_struct *creator;
> +
> + /*
> + * A concurrent exit_shm may do a list_del_init() as well.
> + * Just do nothing if exit_shm already did the work
> + */
> + if (list_empty(&shp->shm_clist))
> + return;
> +
> + /*
> + * shp->shm_creator is guaranteed to be valid *only*
> + * if shp->shm_clist is not empty.
> + */
> + creator = shp->shm_creator;
> +
> + task_lock(creator);
> + list_del_init(&shp->shm_clist);
> + task_unlock(creator);
Lock ordering
rwsem
ipc_lock
task_lock
> +}
> +
> +static inline void shm_rmid(struct shmid_kernel *s)
> +{
> + shm_clist_rm(s);
> + ipc_rmid(&shm_ids(s->ns), &s->shm_perm);
> }
>
>
> @@ -283,7 +319,7 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
> shm_file = shp->shm_file;
> shp->shm_file = NULL;
> ns->shm_tot -= (shp->shm_segsz + PAGE_SIZE - 1) >> PAGE_SHIFT;
> - shm_rmid(ns, shp);
> + shm_rmid(shp);
> shm_unlock(shp);
> if (!is_file_hugepages(shm_file))
> shmem_lock(shm_file, 0, shp->mlock_ucounts);
> @@ -306,10 +342,10 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
> *
> * 2) sysctl kernel.shm_rmid_forced is set to 1.
> */
> -static bool shm_may_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
> +static bool shm_may_destroy(struct shmid_kernel *shp)
> {
> return (shp->shm_nattch == 0) &&
> - (ns->shm_rmid_forced ||
> + (shp->ns->shm_rmid_forced ||
> (shp->shm_perm.mode & SHM_DEST));
> }
>
> @@ -340,7 +376,7 @@ static void shm_close(struct vm_area_struct *vma)
> ipc_update_pid(&shp->shm_lprid, task_tgid(current));
> shp->shm_dtim = ktime_get_real_seconds();
> shp->shm_nattch--;
> - if (shm_may_destroy(ns, shp))
> + if (shm_may_destroy(shp))
> shm_destroy(ns, shp);
> else
> shm_unlock(shp);
> @@ -361,10 +397,10 @@ static int shm_try_destroy_orphaned(int id, void *p, void *data)
> *
> * As shp->* are changed under rwsem, it's safe to skip shp locking.
> */
> - if (shp->shm_creator != NULL)
> + if (!list_empty(&shp->shm_clist))
> return 0;
>
> - if (shm_may_destroy(ns, shp)) {
> + if (shm_may_destroy(shp)) {
> shm_lock_by_ptr(shp);
> shm_destroy(ns, shp);
> }
> @@ -382,48 +418,87 @@ void shm_destroy_orphaned(struct ipc_namespace *ns)
> /* Locking assumes this will only be called with task == current */
> void exit_shm(struct task_struct *task)
> {
> - struct ipc_namespace *ns = task->nsproxy->ipc_ns;
> - struct shmid_kernel *shp, *n;
> + for (;;) {
> + struct shmid_kernel *shp;
> + struct ipc_namespace *ns;
>
> - if (list_empty(&task->sysvshm.shm_clist))
> - return;
> + task_lock(task);
> +
> + if (list_empty(&task->sysvshm.shm_clist)) {
> + task_unlock(task);
> + break;
> + }
> +
> + shp = list_first_entry(&task->sysvshm.shm_clist, struct shmid_kernel,
> + shm_clist);
> +
> + /* 1) unlink */
> + list_del_init(&shp->shm_clist);
>
> - /*
> - * If kernel.shm_rmid_forced is not set then only keep track of
> - * which shmids are orphaned, so that a later set of the sysctl
> - * can clean them up.
> - */
> - if (!ns->shm_rmid_forced) {
> - down_read(&shm_ids(ns).rwsem);
> - list_for_each_entry(shp, &task->sysvshm.shm_clist, shm_clist)
> - shp->shm_creator = NULL;
> /*
> - * Only under read lock but we are only called on current
> - * so no entry on the list will be shared.
> + * 2) Get pointer to the ipc namespace. It is worth to say
> + * that this pointer is guaranteed to be valid because
> + * shp lifetime is always shorter than namespace lifetime
> + * in which shp lives.
> + * We taken task_lock it means that shp won't be freed.
> */
> - list_del(&task->sysvshm.shm_clist);
> - up_read(&shm_ids(ns).rwsem);
> - return;
> - }
> + ns = shp->ns;
>
> - /*
> - * Destroy all already created segments, that were not yet mapped,
> - * and mark any mapped as orphan to cover the sysctl toggling.
> - * Destroy is skipped if shm_may_destroy() returns false.
> - */
> - down_write(&shm_ids(ns).rwsem);
> - list_for_each_entry_safe(shp, n, &task->sysvshm.shm_clist, shm_clist) {
> - shp->shm_creator = NULL;
> + /*
> + * 3) If kernel.shm_rmid_forced is not set then only keep track of
> + * which shmids are orphaned, so that a later set of the sysctl
> + * can clean them up.
> + */
> + if (!ns->shm_rmid_forced) {
> + task_unlock(task);
> + continue;
> + }
>
> - if (shm_may_destroy(ns, shp)) {
> + /*
> + * 4) get a reference to the namespace.
> + * The refcount could be already 0. If it is 0, then
> + * the shm objects will be free by free_ipc_work().
> + */
> + ns = get_ipc_ns_not_zero(ns);
> + if (ns) {
> + /*
> + * 5) get a reference to the shp itself.
> + * This cannot fail: shm_clist_rm() is called before
> + * ipc_rmid(), thus the refcount cannot be 0.
> + */
> + WARN_ON(!ipc_rcu_getref(&shp->shm_perm));
> + }
> +
> + task_unlock(task);
<<<<<<<<< BOOM >>>>>>>
I don't see anything that prevents another task from
calling shm_destroy(ns, shp) here and freeing it before
this task can take the rwsem for writing.
> +
> + if (ns) {
> + down_write(&shm_ids(ns).rwsem);
> shm_lock_by_ptr(shp);
> - shm_destroy(ns, shp);
> + /*
> + * rcu_read_lock was implicitly taken in
> + * shm_lock_by_ptr, it's safe to call
> + * ipc_rcu_putref here
> + */
> + ipc_rcu_putref(&shp->shm_perm, shm_rcu_free);
> +
> + if (ipc_valid_object(&shp->shm_perm)) {
> + if (shm_may_destroy(shp))
> + shm_destroy(ns, shp);
> + else
> + shm_unlock(shp);
> + } else {
> + /*
> + * Someone else deleted the shp from namespace
> + * idr/kht while we have waited.
> + * Just unlock and continue.
> + */
> + shm_unlock(shp);
> + }
> +
> + up_write(&shm_ids(ns).rwsem);
> + put_ipc_ns(ns); /* paired with get_ipc_ns_not_zero */
> }
> }
> -
> - /* Remove the list head from any segments still attached. */
> - list_del(&task->sysvshm.shm_clist);
> - up_write(&shm_ids(ns).rwsem);
> }
>
> static vm_fault_t shm_fault(struct vm_fault *vmf)
> @@ -680,7 +755,11 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
> if (error < 0)
> goto no_id;
>
> + shp->ns = ns;
> +
> + task_lock(current);
> list_add(&shp->shm_clist, ¤t->sysvshm.shm_clist);
> + task_unlock(current);
>
> /*
> * shmid gets reported as "inode#" in /proc/pid/maps.
> @@ -1573,7 +1652,8 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
> down_write(&shm_ids(ns).rwsem);
> shp = shm_lock(ns, shmid);
> shp->shm_nattch--;
> - if (shm_may_destroy(ns, shp))
> +
> + if (shm_may_destroy(shp))
> shm_destroy(ns, shp);
> else
> shm_unlock(shp);