Re: [PATCH 2/2] shm: extend forced shm destroy to support objects from several IPC nses

From: Manfred Spraul
Date: Sat Oct 30 2021 - 09:11:57 EST


On 10/30/21 06:26, Eric W. Biederman wrote:
Alexander Mikhalitsyn <alexander.mikhalitsyn@xxxxxxxxxxxxx> writes:

Currently, exit_shm function not designed to work properly when
task->sysvshm.shm_clist holds shm objects from different IPC namespaces.

This is a real pain when sysctl kernel.shm_rmid_forced = 1, because
it leads to use-after-free (reproducer exists).

That particular patch is attempt to fix the problem by extending exit_shm
mechanism to handle shm's destroy from several IPC ns'es.

To achieve that we do several things:
1. add namespace (non-refcounted) pointer to the struct shmid_kernel
2. during new shm object creation (newseg()/shmget syscall) we initialize
this pointer by current task IPC ns
3. exit_shm() fully reworked such that it traverses over all
shp's in task->sysvshm.shm_clist and gets IPC namespace not
from current task as it was before but from shp's object itself, then
call shm_destroy(shp, ns).

Note. We need to be really careful here, because as it was said before
(1), our pointer to IPC ns non-refcnt'ed. To be on the safe side we using
special helper get_ipc_ns_not_zero() which allows to get IPC ns refcounter
only if IPC ns not in the "state of destruction".



Q/A

Q: Why we can access shp->ns memory using non-refcounted pointer?
A: Because shp object lifetime is always shorther
than IPC namespace lifetime, so, if we get shp object from the
task->sysvshm.shm_clist while holding task_lock(task) nobody can
steal our namespace.
Not true. A struct shmid_kernel can outlive the namespace in which it
was created. I you look at do_shm_rmid which is called when the
namespace is destroyed for every shmid_kernel in the namespace that if
the struct shmid_kernel still has users only ipc_set_key_private is
called. The struct shmid_kernel continues to exist.

No, shm_nattach is always 0 when a namespace is destroyed.

Thus it is impossible that shmid_kernel continues to exist.

Let's check all shm_nattach modifications:

1) do_shmat:

    shp->shm_nattach++;

    sfd->ns = get_ipc_ns(ns);

    shp->shm_nattach--;

pairs with

   shm_release()

        put_ipc_ns()

2) shm_open()

only shp->shm_nattach++

shm_open unconditionally accesses shm_file_data, i.e. sfd must be valid, there must be a reference to the namespace

pairs with shm_close()

only shp->shm_nattach--;

shm_close unconditionally accesses shm_file_data, i.e. sfd must be valid, there must be a reference to the namespace

As shm_open()/close "nests" inside do_shmat: there is always a get_ipc_ns().

Or, much simpler: Check shm_open() and shm_close():

These two functions address a shm segment by namespace and  ID, not by a shm pointer. Thus _if_ it is possible that shm_nattach is > 0 at namespace destruction, then there would be far more issues.


Or: Attached is a log file, a test application, and a patch that adds pr_info statements.

The namespace is destroyed immediately when no segments are mapped, the destruction is delayed until exit() if there are mapped segments.


Q: Does this patch change semantics of unshare/setns/clone syscalls?
A: Not. It's just fixes non-covered case when process may leave
IPC namespace without getting task->sysvshm.shm_clist list cleaned up.

Just reading through exit_shm the code is not currently safe.

At a minimum do_shm_rmid needs to set the shp->ns to NULL. Otherwise
the struct shmid_kernel can contain a namespace pointer after
the namespace exits. Which results in a different use after free.
No [unless there are additional bugs]

Beyond that there is dropping the task lock. The code holds a reference
to the namespace which means that the code does not need to worry about
free_ipcs. References from mappings are still possible.

Which means that the code could see:
exit_shm()
task_lock()
shp = ...;

task_unlock()
shm_close()
down_write(&shm_ids(ns).rwsem);
...
shm_destroy(shp);
up_write(&shm_ids(ns).rwsem);
down_write(&shm_ids(ns)->rwsem);
shm_lock_by_ptr(shp); /* use after free */


I am trying to imagine how to close that race with the current code
structure. Maybe something could be done by looking at shm_nattach
count and making it safe to look at that count under the task_lock.

There is no race. Before dropping task_lock, a reference to both the namespace and the shp pointer is obtained.

Thus neither one can disappear.

But even then because shmid_kernel is still in the hash table it could
be mapped and unmapped in the window when task_lock was dropped.

We have ipc_valid_object(), i.e. perm->deleted. If set, then the pointer and the spinlock are valid, even though the rest is already destroyed.

ipc_rmid() just sets deleted, the (rcu delayed) kfree is done via ipc_rcu_putref().
Alternatively shmctl(id, IPC_RMID) can be called in when task_lock is
dropped. Much less code is involved than mapping and unmapping so it is
much more likely to win the race.

I don't see how that race can be closed.

Am I missing something?

Eric


Fixes: ab602f79915 ("shm: make exit_shm work proportional to task activity")

Cc: "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx>
Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Cc: Davidlohr Bueso <dave@xxxxxxxxxxxx>
Cc: Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx>
Cc: Andrei Vagin <avagin@xxxxxxxxx>
Cc: Pavel Tikhomirov <ptikhomirov@xxxxxxxxxxxxx>
Cc: Vasily Averin <vvs@xxxxxxxxxxxxx>
Cc: Manfred Spraul <manfred@xxxxxxxxxxxxxxxx>
Cc: Alexander Mikhalitsyn <alexander@xxxxxxxxxxxxx>
Cc: stable@xxxxxxxxxxxxxxx
Co-developed-by: Manfred Spraul <manfred@xxxxxxxxxxxxxxxx>
Signed-off-by: Manfred Spraul <manfred@xxxxxxxxxxxxxxxx>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@xxxxxxxxxxxxx>

Should/can I mark that I have tested the code?

I would drop one change and one comment is incorrect, otherwise no findings. See the attached 0002 patch

Tested-by: Manfred Spraul <manfred@xxxxxxxxxxxxxxxx>

---
include/linux/ipc_namespace.h | 15 +++
include/linux/sched/task.h | 2 +-
include/linux/shm.h | 2 +-
ipc/shm.c | 170 +++++++++++++++++++++++++---------
4 files changed, 142 insertions(+), 47 deletions(-)

diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index 05e22770af51..b75395ec8d52 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -131,6 +131,16 @@ static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
return ns;
}
+static inline struct ipc_namespace *get_ipc_ns_not_zero(struct ipc_namespace *ns)
+{
+ if (ns) {
+ if (refcount_inc_not_zero(&ns->ns.count))
+ return ns;
+ }
+
+ return NULL;
+}
+
extern void put_ipc_ns(struct ipc_namespace *ns);
#else
static inline struct ipc_namespace *copy_ipcs(unsigned long flags,
@@ -147,6 +157,11 @@ static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
return ns;
}
+static inline struct ipc_namespace *get_ipc_ns_not_zero(struct ipc_namespace *ns)
+{
+ return ns;
+}
+
static inline void put_ipc_ns(struct ipc_namespace *ns)
{
}
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index ef02be869cf2..bfdf84dab4be 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -157,7 +157,7 @@ static inline struct vm_struct *task_stack_vm_area(const struct task_struct *t)
* Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring
* subscriptions and synchronises with wait4(). Also used in procfs. Also
* pins the final release of task.io_context. Also protects ->cpuset and
- * ->cgroup.subsys[]. And ->vfork_done.
+ * ->cgroup.subsys[]. And ->vfork_done. And ->sysvshm.shm_clist.
*
* Nests both inside and outside of read_lock(&tasklist_lock).
* It must not be nested with write_lock_irq(&tasklist_lock),
diff --git a/include/linux/shm.h b/include/linux/shm.h
index d8e69aed3d32..709f6d0451c0 100644
--- a/include/linux/shm.h
+++ b/include/linux/shm.h
@@ -11,7 +11,7 @@ struct file;
#ifdef CONFIG_SYSVIPC
struct sysv_shm {
- struct list_head shm_clist;
+ struct list_head shm_clist;
};
This is a whitespace only change. We can drop it.
long do_shmat(int shmid, char __user *shmaddr, int shmflg, unsigned long *addr,
diff --git a/ipc/shm.c b/ipc/shm.c
index 748933e376ca..29667e17b12a 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -62,9 +62,18 @@ struct shmid_kernel /* private to the kernel */
struct pid *shm_lprid;
struct ucounts *mlock_ucounts;
- /* The task created the shm object. NULL if the task is dead. */
+ /*
+ * The task created the shm object, for looking up
+ * task->sysvshm.shm_clist_lock
+ */
struct task_struct *shm_creator;
- struct list_head shm_clist; /* list by creator */
+
+ /*
+ * list by creator. shm_clist_lock required for read/write
+ * if list_empty(), then the creator is dead already
+ */
shm_clist_lock was replaced by task_lock(->shm_creator).
+ struct list_head shm_clist;
+ struct ipc_namespace *ns;
} __randomize_layout;
/* shm_mode upper byte flags */
@@ -115,6 +124,7 @@ static void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
struct shmid_kernel *shp;
shp = container_of(ipcp, struct shmid_kernel, shm_perm);
+ WARN_ON(ns != shp->ns);
if (shp->shm_nattch) {
shp->shm_perm.mode |= SHM_DEST;
@@ -225,10 +235,36 @@ static void shm_rcu_free(struct rcu_head *head)
kfree(shp);
}
-static inline void shm_rmid(struct ipc_namespace *ns, struct shmid_kernel *s)
+/*
+ * It has to be called with shp locked.
+ * It must be called before ipc_rmid()
+ */
+static inline void shm_clist_rm(struct shmid_kernel *shp)
{
- list_del(&s->shm_clist);
- ipc_rmid(&shm_ids(ns), &s->shm_perm);
+ struct task_struct *creator;
+
+ /*
+ * A concurrent exit_shm may do a list_del_init() as well.
+ * Just do nothing if exit_shm already did the work
+ */
+ if (list_empty(&shp->shm_clist))
+ return;
+
+ /*
+ * shp->shm_creator is guaranteed to be valid *only*
+ * if shp->shm_clist is not empty.
+ */
+ creator = shp->shm_creator;
+
+ task_lock(creator);
+ list_del_init(&shp->shm_clist);
+ task_unlock(creator);
Lock ordering
rwsem
ipc_lock
task_lock
correct.
+}
+
+static inline void shm_rmid(struct shmid_kernel *s)
+{
+ shm_clist_rm(s);
+ ipc_rmid(&shm_ids(s->ns), &s->shm_perm);
}
@@ -283,7 +319,7 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
shm_file = shp->shm_file;
shp->shm_file = NULL;
ns->shm_tot -= (shp->shm_segsz + PAGE_SIZE - 1) >> PAGE_SHIFT;
- shm_rmid(ns, shp);
+ shm_rmid(shp);
shm_unlock(shp);
if (!is_file_hugepages(shm_file))
shmem_lock(shm_file, 0, shp->mlock_ucounts);
@@ -306,10 +342,10 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
*
* 2) sysctl kernel.shm_rmid_forced is set to 1.
*/
-static bool shm_may_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
+static bool shm_may_destroy(struct shmid_kernel *shp)
{
return (shp->shm_nattch == 0) &&
- (ns->shm_rmid_forced ||
+ (shp->ns->shm_rmid_forced ||
(shp->shm_perm.mode & SHM_DEST));
}
@@ -340,7 +376,7 @@ static void shm_close(struct vm_area_struct *vma)
ipc_update_pid(&shp->shm_lprid, task_tgid(current));
shp->shm_dtim = ktime_get_real_seconds();
shp->shm_nattch--;
- if (shm_may_destroy(ns, shp))
+ if (shm_may_destroy(shp))
shm_destroy(ns, shp);
else
shm_unlock(shp);
@@ -361,10 +397,10 @@ static int shm_try_destroy_orphaned(int id, void *p, void *data)
*
* As shp->* are changed under rwsem, it's safe to skip shp locking.
*/
- if (shp->shm_creator != NULL)
+ if (!list_empty(&shp->shm_clist))
return 0;
- if (shm_may_destroy(ns, shp)) {
+ if (shm_may_destroy(shp)) {
shm_lock_by_ptr(shp);
shm_destroy(ns, shp);
}
@@ -382,48 +418,87 @@ void shm_destroy_orphaned(struct ipc_namespace *ns)
/* Locking assumes this will only be called with task == current */
void exit_shm(struct task_struct *task)
{
- struct ipc_namespace *ns = task->nsproxy->ipc_ns;
- struct shmid_kernel *shp, *n;
+ for (;;) {
+ struct shmid_kernel *shp;
+ struct ipc_namespace *ns;
- if (list_empty(&task->sysvshm.shm_clist))
- return;
+ task_lock(task);
+
+ if (list_empty(&task->sysvshm.shm_clist)) {
+ task_unlock(task);
+ break;
+ }
+
+ shp = list_first_entry(&task->sysvshm.shm_clist, struct shmid_kernel,
+ shm_clist);
+
+ /* 1) unlink */
+ list_del_init(&shp->shm_clist);
- /*
- * If kernel.shm_rmid_forced is not set then only keep track of
- * which shmids are orphaned, so that a later set of the sysctl
- * can clean them up.
- */
- if (!ns->shm_rmid_forced) {
- down_read(&shm_ids(ns).rwsem);
- list_for_each_entry(shp, &task->sysvshm.shm_clist, shm_clist)
- shp->shm_creator = NULL;
/*
- * Only under read lock but we are only called on current
- * so no entry on the list will be shared.
+ * 2) Get pointer to the ipc namespace. It is worth to say
+ * that this pointer is guaranteed to be valid because
+ * shp lifetime is always shorter than namespace lifetime
+ * in which shp lives.
+ * We taken task_lock it means that shp won't be freed.
*/
- list_del(&task->sysvshm.shm_clist);
- up_read(&shm_ids(ns).rwsem);
- return;
- }
+ ns = shp->ns;
- /*
- * Destroy all already created segments, that were not yet mapped,
- * and mark any mapped as orphan to cover the sysctl toggling.
- * Destroy is skipped if shm_may_destroy() returns false.
- */
- down_write(&shm_ids(ns).rwsem);
- list_for_each_entry_safe(shp, n, &task->sysvshm.shm_clist, shm_clist) {
- shp->shm_creator = NULL;
+ /*
+ * 3) If kernel.shm_rmid_forced is not set then only keep track of
+ * which shmids are orphaned, so that a later set of the sysctl
+ * can clean them up.
+ */
+ if (!ns->shm_rmid_forced) {
+ task_unlock(task);
+ continue;
+ }
- if (shm_may_destroy(ns, shp)) {
+ /*
+ * 4) get a reference to the namespace.
+ * The refcount could be already 0. If it is 0, then
+ * the shm objects will be free by free_ipc_work().
+ */
+ ns = get_ipc_ns_not_zero(ns);
+ if (ns) {
+ /*
+ * 5) get a reference to the shp itself.
+ * This cannot fail: shm_clist_rm() is called before
+ * ipc_rmid(), thus the refcount cannot be 0.
+ */
+ WARN_ON(!ipc_rcu_getref(&shp->shm_perm));
+ }
+
+ task_unlock(task);
<<<<<<<<< BOOM >>>>>>>

I don't see anything that prevents another task from
calling shm_destroy(ns, shp) here and freeing it before
this task can take the rwsem for writing.

shm_destroy() can be called. But due to the ipc_rcu_getref(), the structure will remain valid.


+
+ if (ns) {
+ down_write(&shm_ids(ns).rwsem);
shm_lock_by_ptr(shp);
- shm_destroy(ns, shp);
+ /*
+ * rcu_read_lock was implicitly taken in
+ * shm_lock_by_ptr, it's safe to call
+ * ipc_rcu_putref here
+ */
+ ipc_rcu_putref(&shp->shm_perm, shm_rcu_free);
+
+ if (ipc_valid_object(&shp->shm_perm)) {

And this will return false if there was a shm_destroy().


+ if (shm_may_destroy(shp))
+ shm_destroy(ns, shp);
+ else
+ shm_unlock(shp);
+ } else {
+ /*
+ * Someone else deleted the shp from namespace
+ * idr/kht while we have waited.
+ * Just unlock and continue.
+ */

-> just do a NOP if shm_destroy() was alread performed.

Actually, the same design is used by find_alloc_undo() in ipc/sem.c.

+ shm_unlock(shp);
+ }
+
+ up_write(&shm_ids(ns).rwsem);
+ put_ipc_ns(ns); /* paired with get_ipc_ns_not_zero */
}
}
-
- /* Remove the list head from any segments still attached. */
- list_del(&task->sysvshm.shm_clist);
- up_write(&shm_ids(ns).rwsem);
}
static vm_fault_t shm_fault(struct vm_fault *vmf)
@@ -680,7 +755,11 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
if (error < 0)
goto no_id;
+ shp->ns = ns;
+
+ task_lock(current);
list_add(&shp->shm_clist, &current->sysvshm.shm_clist);
+ task_unlock(current);
/*
* shmid gets reported as "inode#" in /proc/pid/maps.
@@ -1573,7 +1652,8 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
down_write(&shm_ids(ns).rwsem);
shp = shm_lock(ns, shmid);
shp->shm_nattch--;
- if (shm_may_destroy(ns, shp))
+
+ if (shm_may_destroy(shp))
shm_destroy(ns, shp);
else
shm_unlock(shp);

From c9b0b5037865aa7714b0e7c96082e0296d8a42b9 Mon Sep 17 00:00:00 2001
From: Alexander Mikhalitsyn <alexander.mikhalitsyn@xxxxxxxxxxxxx>
Date: Thu, 28 Oct 2021 01:43:48 +0300
Subject: [PATCH 2/3] shm: extend forced shm destroy to support objects from
several IPC nses

Currently, exit_shm function not designed to work properly when
task->sysvshm.shm_clist holds shm objects from different IPC namespaces.

This is a real pain when sysctl kernel.shm_rmid_forced = 1, because
it leads to use-after-free (reproducer exists).

That particular patch is attempt to fix the problem by extending exit_shm
mechanism to handle shm's destroy from several IPC ns'es.

To achieve that we do several things:
1. add namespace (non-refcounted) pointer to the struct shmid_kernel
2. during new shm object creation (newseg()/shmget syscall) we initialize
this pointer by current task IPC ns
3. exit_shm() fully reworked such that it traverses over all
shp's in task->sysvshm.shm_clist and gets IPC namespace not
from current task as it was before but from shp's object itself, then
call shm_destroy(shp, ns).

Note. We need to be really careful here, because as it was said before
(1), our pointer to IPC ns non-refcnt'ed. To be on the safe side we using
special helper get_ipc_ns_not_zero() which allows to get IPC ns refcounter
only if IPC ns not in the "state of destruction".

Q/A

Q: Why we can access shp->ns memory using non-refcounted pointer?
A: Because shp object lifetime is always shorther
than IPC namespace lifetime, so, if we get shp object from the
task->sysvshm.shm_clist while holding task_lock(task) nobody can
steal our namespace.

Q: Does this patch change semantics of unshare/setns/clone syscalls?
A: Not. It's just fixes non-covered case when process may leave
IPC namespace without getting task->sysvshm.shm_clist list cleaned up.

Fixes: ab602f79915 ("shm: make exit_shm work proportional to task activity")

Cc: "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx>
Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Cc: Davidlohr Bueso <dave@xxxxxxxxxxxx>
Cc: Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx>
Cc: Andrei Vagin <avagin@xxxxxxxxx>
Cc: Pavel Tikhomirov <ptikhomirov@xxxxxxxxxxxxx>
Cc: Vasily Averin <vvs@xxxxxxxxxxxxx>
Cc: Manfred Spraul <manfred@xxxxxxxxxxxxxxxx>
Cc: Alexander Mikhalitsyn <alexander@xxxxxxxxxxxxx>
Cc: stable@xxxxxxxxxxxxxxx
Co-developed-by: Manfred Spraul <manfred@xxxxxxxxxxxxxxxx>
Signed-off-by: Manfred Spraul <manfred@xxxxxxxxxxxxxxxx>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@xxxxxxxxxxxxx>
---
include/linux/ipc_namespace.h | 15 +++
include/linux/sched/task.h | 2 +-
include/linux/shm.h | 2 +-
ipc/shm.c | 170 +++++++++++++++++++++++++---------
4 files changed, 142 insertions(+), 47 deletions(-)

diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index 05e22770af51..b75395ec8d52 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -131,6 +131,16 @@ static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
return ns;
}

+static inline struct ipc_namespace *get_ipc_ns_not_zero(struct ipc_namespace *ns)
+{
+ if (ns) {
+ if (refcount_inc_not_zero(&ns->ns.count))
+ return ns;
+ }
+
+ return NULL;
+}
+
extern void put_ipc_ns(struct ipc_namespace *ns);
#else
static inline struct ipc_namespace *copy_ipcs(unsigned long flags,
@@ -147,6 +157,11 @@ static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
return ns;
}

+static inline struct ipc_namespace *get_ipc_ns_not_zero(struct ipc_namespace *ns)
+{
+ return ns;
+}
+
static inline void put_ipc_ns(struct ipc_namespace *ns)
{
}
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index ef02be869cf2..bfdf84dab4be 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -157,7 +157,7 @@ static inline struct vm_struct *task_stack_vm_area(const struct task_struct *t)
* Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring
* subscriptions and synchronises with wait4(). Also used in procfs. Also
* pins the final release of task.io_context. Also protects ->cpuset and
- * ->cgroup.subsys[]. And ->vfork_done.
+ * ->cgroup.subsys[]. And ->vfork_done. And ->sysvshm.shm_clist.
*
* Nests both inside and outside of read_lock(&tasklist_lock).
* It must not be nested with write_lock_irq(&tasklist_lock),
diff --git a/ipc/shm.c b/ipc/shm.c
index ab749be6d8b7..ebb25a8ecc58 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -62,9 +62,18 @@ struct shmid_kernel /* private to the kernel */
struct pid *shm_lprid;
struct ucounts *mlock_ucounts;

- /* The task created the shm object. NULL if the task is dead. */
+ /*
+ * The task created the shm object, for looking up
+ * task->sysvshm.shm_clist_lock
+ */
struct task_struct *shm_creator;
- struct list_head shm_clist; /* list by creator */
+
+ /*
+ * List by creator. task_lock(->shm_creator) required for read/write.
+ * If list_empty(), then the creator is dead already.
+ */
+ struct list_head shm_clist;
+ struct ipc_namespace *ns;
} __randomize_layout;

/* shm_mode upper byte flags */
@@ -115,6 +124,7 @@ static void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
struct shmid_kernel *shp;

shp = container_of(ipcp, struct shmid_kernel, shm_perm);
+ WARN_ON(ns != shp->ns);

if (shp->shm_nattch) {
shp->shm_perm.mode |= SHM_DEST;
@@ -225,10 +235,36 @@ static void shm_rcu_free(struct rcu_head *head)
kfree(shp);
}

-static inline void shm_rmid(struct ipc_namespace *ns, struct shmid_kernel *s)
+/*
+ * It has to be called with shp locked.
+ * It must be called before ipc_rmid()
+ */
+static inline void shm_clist_rm(struct shmid_kernel *shp)
{
- list_del(&s->shm_clist);
- ipc_rmid(&shm_ids(ns), &s->shm_perm);
+ struct task_struct *creator;
+
+ /*
+ * A concurrent exit_shm may do a list_del_init() as well.
+ * Just do nothing if exit_shm already did the work
+ */
+ if (list_empty(&shp->shm_clist))
+ return;
+
+ /*
+ * shp->shm_creator is guaranteed to be valid *only*
+ * if shp->shm_clist is not empty.
+ */
+ creator = shp->shm_creator;
+
+ task_lock(creator);
+ list_del_init(&shp->shm_clist);
+ task_unlock(creator);
+}
+
+static inline void shm_rmid(struct shmid_kernel *s)
+{
+ shm_clist_rm(s);
+ ipc_rmid(&shm_ids(s->ns), &s->shm_perm);
}


@@ -283,7 +319,7 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
shm_file = shp->shm_file;
shp->shm_file = NULL;
ns->shm_tot -= (shp->shm_segsz + PAGE_SIZE - 1) >> PAGE_SHIFT;
- shm_rmid(ns, shp);
+ shm_rmid(shp);
shm_unlock(shp);
if (!is_file_hugepages(shm_file))
shmem_lock(shm_file, 0, shp->mlock_ucounts);
@@ -306,10 +342,10 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
*
* 2) sysctl kernel.shm_rmid_forced is set to 1.
*/
-static bool shm_may_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
+static bool shm_may_destroy(struct shmid_kernel *shp)
{
return (shp->shm_nattch == 0) &&
- (ns->shm_rmid_forced ||
+ (shp->ns->shm_rmid_forced ||
(shp->shm_perm.mode & SHM_DEST));
}

@@ -340,7 +376,7 @@ static void shm_close(struct vm_area_struct *vma)
ipc_update_pid(&shp->shm_lprid, task_tgid(current));
shp->shm_dtim = ktime_get_real_seconds();
shp->shm_nattch--;
- if (shm_may_destroy(ns, shp))
+ if (shm_may_destroy(shp))
shm_destroy(ns, shp);
else
shm_unlock(shp);
@@ -361,10 +397,10 @@ static int shm_try_destroy_orphaned(int id, void *p, void *data)
*
* As shp->* are changed under rwsem, it's safe to skip shp locking.
*/
- if (shp->shm_creator != NULL)
+ if (!list_empty(&shp->shm_clist))
return 0;

- if (shm_may_destroy(ns, shp)) {
+ if (shm_may_destroy(shp)) {
shm_lock_by_ptr(shp);
shm_destroy(ns, shp);
}
@@ -382,48 +418,87 @@ void shm_destroy_orphaned(struct ipc_namespace *ns)
/* Locking assumes this will only be called with task == current */
void exit_shm(struct task_struct *task)
{
- struct ipc_namespace *ns = task->nsproxy->ipc_ns;
- struct shmid_kernel *shp, *n;
+ for (;;) {
+ struct shmid_kernel *shp;
+ struct ipc_namespace *ns;

- if (list_empty(&task->sysvshm.shm_clist))
- return;
+ task_lock(task);
+
+ if (list_empty(&task->sysvshm.shm_clist)) {
+ task_unlock(task);
+ break;
+ }
+
+ shp = list_first_entry(&task->sysvshm.shm_clist, struct shmid_kernel,
+ shm_clist);
+
+ /* 1) unlink */
+ list_del_init(&shp->shm_clist);

- /*
- * If kernel.shm_rmid_forced is not set then only keep track of
- * which shmids are orphaned, so that a later set of the sysctl
- * can clean them up.
- */
- if (!ns->shm_rmid_forced) {
- down_read(&shm_ids(ns).rwsem);
- list_for_each_entry(shp, &task->sysvshm.shm_clist, shm_clist)
- shp->shm_creator = NULL;
/*
- * Only under read lock but we are only called on current
- * so no entry on the list will be shared.
+ * 2) Get pointer to the ipc namespace. It is worth to say
+ * that this pointer is guaranteed to be valid because
+ * shp lifetime is always shorter than namespace lifetime
+ * in which shp lives.
+ * We taken task_lock it means that shp won't be freed.
*/
- list_del(&task->sysvshm.shm_clist);
- up_read(&shm_ids(ns).rwsem);
- return;
- }
+ ns = shp->ns;

- /*
- * Destroy all already created segments, that were not yet mapped,
- * and mark any mapped as orphan to cover the sysctl toggling.
- * Destroy is skipped if shm_may_destroy() returns false.
- */
- down_write(&shm_ids(ns).rwsem);
- list_for_each_entry_safe(shp, n, &task->sysvshm.shm_clist, shm_clist) {
- shp->shm_creator = NULL;
+ /*
+ * 3) If kernel.shm_rmid_forced is not set then only keep track of
+ * which shmids are orphaned, so that a later set of the sysctl
+ * can clean them up.
+ */
+ if (!ns->shm_rmid_forced) {
+ task_unlock(task);
+ continue;
+ }

- if (shm_may_destroy(ns, shp)) {
+ /*
+ * 4) get a reference to the namespace.
+ * The refcount could be already 0. If it is 0, then
+ * the shm objects will be free by free_ipc_work().
+ */
+ ns = get_ipc_ns_not_zero(ns);
+ if (ns) {
+ /*
+ * 5) get a reference to the shp itself.
+ * This cannot fail: shm_clist_rm() is called before
+ * ipc_rmid(), thus the refcount cannot be 0.
+ */
+ WARN_ON(!ipc_rcu_getref(&shp->shm_perm));
+ }
+
+ task_unlock(task);
+
+ if (ns) {
+ down_write(&shm_ids(ns).rwsem);
shm_lock_by_ptr(shp);
- shm_destroy(ns, shp);
+ /*
+ * rcu_read_lock was implicitly taken in
+ * shm_lock_by_ptr, it's safe to call
+ * ipc_rcu_putref here
+ */
+ ipc_rcu_putref(&shp->shm_perm, shm_rcu_free);
+
+ if (ipc_valid_object(&shp->shm_perm)) {
+ if (shm_may_destroy(shp))
+ shm_destroy(ns, shp);
+ else
+ shm_unlock(shp);
+ } else {
+ /*
+ * Someone else deleted the shp from namespace
+ * idr/kht while we have waited.
+ * Just unlock and continue.
+ */
+ shm_unlock(shp);
+ }
+
+ up_write(&shm_ids(ns).rwsem);
+ put_ipc_ns(ns); /* paired with get_ipc_ns_not_zero */
}
}
-
- /* Remove the list head from any segments still attached. */
- list_del(&task->sysvshm.shm_clist);
- up_write(&shm_ids(ns).rwsem);
}

static vm_fault_t shm_fault(struct vm_fault *vmf)
@@ -680,7 +755,11 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
if (error < 0)
goto no_id;

+ shp->ns = ns;
+
+ task_lock(current);
list_add(&shp->shm_clist, &current->sysvshm.shm_clist);
+ task_unlock(current);

/*
* shmid gets reported as "inode#" in /proc/pid/maps.
@@ -1573,7 +1652,8 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
down_write(&shm_ids(ns).rwsem);
shp = shm_lock(ns, shmid);
shp->shm_nattch--;
- if (shm_may_destroy(ns, shp))
+
+ if (shm_may_destroy(shp))
shm_destroy(ns, shp);
else
shm_unlock(shp);
--
2.31.1

From ed67173357031d9a501e41b6be05cfc438f44adc Mon Sep 17 00:00:00 2001
From: Manfred Spraul <manfred@xxxxxxxxxxxxxxxx>
Date: Sat, 30 Oct 2021 14:27:25 +0200
Subject: [PATCH 3/3] [DEBUG CODE] instrummented ipc/shm.c

Target: show that namespaces cannot outlive a shm segment.

Signed-off-by: Manfred Spraul <manfred@xxxxxxxxxxxxxxxx>
---
ipc/shm.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/ipc/shm.c b/ipc/shm.c
index ebb25a8ecc58..6222d5b8acf6 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -126,6 +126,7 @@ static void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
shp = container_of(ipcp, struct shmid_kernel, shm_perm);
WARN_ON(ns != shp->ns);

+pr_info("do_shm_rmid(): shp %px: shp->shm_nattch %ld.\n", shp, shp->shm_nattch);
if (shp->shm_nattch) {
shp->shm_perm.mode |= SHM_DEST;
/* Do not find it any more */
@@ -138,9 +139,11 @@ static void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
#ifdef CONFIG_IPC_NS
void shm_exit_ns(struct ipc_namespace *ns)
{
+pr_info("namespace %px: in exit_ns.\n", ns);
free_ipcs(ns, &shm_ids(ns), do_shm_rmid);
idr_destroy(&ns->ids[IPC_SHM_IDS].ipcs_idr);
rhashtable_destroy(&ns->ids[IPC_SHM_IDS].key_ht);
+pr_info("namespace %px: end of exit_ns.\n", ns);
}
#endif

@@ -287,6 +290,7 @@ static int __shm_open(struct vm_area_struct *vma)

shp->shm_atim = ktime_get_real_seconds();
ipc_update_pid(&shp->shm_lprid, task_tgid(current));
+pr_info("__shm_open(): before ++: shp %px, sfd->file %px: shp->shm_nattch %ld.\n", shp, sfd->file, shp->shm_nattch);
shp->shm_nattch++;
shm_unlock(shp);
return 0;
@@ -344,6 +348,7 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
*/
static bool shm_may_destroy(struct shmid_kernel *shp)
{
+pr_info("shm_may_destroy(): shp %px: shp->shm_nattch %ld.\n", shp, shp->shm_nattch);
return (shp->shm_nattch == 0) &&
(shp->ns->shm_rmid_forced ||
(shp->shm_perm.mode & SHM_DEST));
@@ -375,6 +380,7 @@ static void shm_close(struct vm_area_struct *vma)

ipc_update_pid(&shp->shm_lprid, task_tgid(current));
shp->shm_dtim = ktime_get_real_seconds();
+pr_info("shm_close(): before --: shp %px: shp->shm_nattch %ld.\n", shp, shp->shm_nattch);
shp->shm_nattch--;
if (shm_may_destroy(shp))
shm_destroy(ns, shp);
@@ -590,6 +596,7 @@ static int shm_release(struct inode *ino, struct file *file)
{
struct shm_file_data *sfd = shm_file_data(file);

+pr_info("shm_release: file %px, put_ipc_ns().\n", sfd->file);
put_ipc_ns(sfd->ns);
fput(sfd->file);
shm_file_data(file) = NULL;
@@ -748,6 +755,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
shp->shm_segsz = size;
shp->shm_nattch = 0;
shp->shm_file = file;
+pr_info("newseg(): shp %px: shp->shm_nattch %ld ->shmfile %px.\n", shp, shp->shm_nattch, shp->shm_file);
shp->shm_creator = current;

/* ipc_addid() locks shp upon success. */
@@ -1588,6 +1596,7 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
* detect shm ID reuse we need to compare the file pointers.
*/
base = get_file(shp->shm_file);
+pr_info("do_shmat(): shp %px: shp->shm_nattch %ld.\n", shp, shp->shm_nattch);
shp->shm_nattch++;
size = i_size_read(file_inode(base));
ipc_unlock_object(&shp->shm_perm);
@@ -1612,6 +1621,7 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
}

sfd->id = shp->shm_perm.id;
+pr_info("do_shmat(): shp %px: get_ipc_ns().\n", shp);
sfd->ns = get_ipc_ns(ns);
sfd->file = base;
sfd->vm_ops = NULL;
@@ -1651,6 +1661,7 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
out_nattch:
down_write(&shm_ids(ns).rwsem);
shp = shm_lock(ns, shmid);
+pr_info("do_shmat() before --: shp %px: shp->shm_nattch %ld.\n", shp, shp->shm_nattch);
shp->shm_nattch--;

if (shm_may_destroy(shp))
--
2.31.1

#include <stdlib.h>
#include <stdbool.h>
#include <stdio.h>
#include <fcntl.h>

#define _GNU_SOURCE
#define __USE_GNU
#include <sched.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/ipc.h>
#include <sys/shm.h>
#include <pthread.h>
#include <unistd.h>

static void do_unshare(int num, int flags)
{
int res;

printf(" %d) unshare(0x%x).\n", num, flags);
res = unshare(flags);
if (res != 0) {
printf(" %d: unshare(0x%x) failed, errno %d.\n", num, flags, errno);
exit(3);
}
}

static void set_rmid_forced(char *value)
{
int fd;
int i;

fd=open("/proc/sys/kernel/shm_rmid_forced", O_RDWR);
if (fd == -1) {
printf("open shm_rmid_forced failed, errno %d.\n", errno);
exit (1);
}
i = write(fd, value, 2);
if (i != 2) {
printf("unexpected result when writing %s to shm_rmid_forced: %d, errno %d.\n", value, i, errno);
exit (2);
}
close(fd);
}

static void *do_shmget(bool map_it)
{
int seg;
void *ptr;

if ((seg = shmget (IPC_PRIVATE, 1, IPC_CREAT| 0600)) == -1) {
perror("shmget");
exit(3);
}
if (map_it) {
if ((ptr = shmat (seg, 0, 0)) == (void*)-1) {
perror ("shmat");
exit(4);
}
} else {
ptr = NULL;
}
return ptr;
}

int main (int argc, char **argv)
{
pid_t child;

(void)argv;
(void)argc;

printf("shmns4:\n");
printf(" One process creates and maps shm segments in multiple namespaces.\n");
printf(" The namespaces are replaced before unmapping the segments.\n");

do_unshare(1, CLONE_NEWIPC);
set_rmid_forced("1\n");

child = fork();
if (child == -1) {
perror ("fork");
exit(5);
}
if (child == 0) {
printf("create a namespace, create 2 shm segments, do not map them.\n");
do_unshare(2, CLONE_NEWIPC);
set_rmid_forced("1\n");
do_shmget(false);
do_shmget(false);

do_unshare(2, CLONE_NEWIPC);
set_rmid_forced("0\n");
sleep(5); /* namespace destruction is done in a worker, thus wait a bit */

printf("create a namespace, create 2 shm segments, do not map them, no auto-rm.\n");
do_shmget(false);
do_shmget(false);

do_unshare(2, CLONE_NEWIPC);
set_rmid_forced("1\n");
sleep(5); /* namespace destruction is done in a worker, thus wait a bit */

printf("create a namespace, create 2 shm segments, map them.\n");
do_shmget(true);
do_shmget(true);

do_unshare(2, CLONE_NEWIPC);
set_rmid_forced("0\n");
sleep(5); /* namespace destruction is done in a worker, thus wait a bit */

printf("Once more: Create a namespace, create 2 shm segments, map them, no auto-rm.\n");
do_shmget(true);
do_shmget(true);


printf("Orphan namespace (switch back to parent namespace).\n");
{
char path[255];
int fd;

sprintf(path, "/proc/%d/ns/ipc", getppid());
fd = open(path, O_RDONLY);
if (fd == -1) {
perror("open ipc ns");
exit(6);
}
if (setns(fd, 0) == -1) {
perror("setns to parent");
exit(7);
}
}
sleep(5); /* namespace destruction is done in a worker, thus wait a bit */

printf("Before exit of child: 4 mappings exist in 2 namespaces.\n");

exit(0);

} else {
int status;
int ret;

sleep(1);
ret = waitpid(child, &status, 0);
sleep(10);
printf("parent:waitpid returned %d, status %d.\n", ret, status);
}
return 0;
}
#./shmns4
shmns4:
One process creates and maps shm segments in multiple namespaces.
The namespaces are replaced before unmapping the segments.
1) unshare(0x8000000).
create a namespace, create 2 shm segments, do not map them.
2) unshare(0x8000000).
[ 71.444890] newseg(): shp ffff888003a84f00: shp->shm_nattch 0 ->shmfile ffff88800428f500.
[ 71.448696] newseg(): shp ffff888003a84e00: shp->shm_nattch 0 ->shmfile ffff88800428f900.
2) unshare(0x8000000).
[ 71.453352] shm_may_destroy(): shp ffff888003a84e00: shp->shm_nattch 0.
[ 71.455822] shm_may_destroy(): shp ffff888003a84f00: shp->shm_nattch 0.
[ 71.460332] namespace ffff888003679400: in exit_ns.
[ 71.461783] namespace ffff888003679400: end of exit_ns.
create a namespace, create 2 shm segments, do not map them, no auto-rm.
[ 76.481527] newseg(): shp ffff888003a84f00: shp->shm_nattch 0 ->shmfile ffff88800428f800.
[ 76.486162] newseg(): shp ffff888003a84e00: shp->shm_nattch 0 ->shmfile ffff88800428f900.
2) unshare(0x8000000).
[ 76.496480] namespace ffff888003679800: in exit_ns.
[ 76.499758] do_shm_rmid(): shp ffff888003a84f00: shp->shm_nattch 0.
[ 76.515934] do_shm_rmid(): shp ffff888003a84e00: shp->shm_nattch 0.
[ 76.537126] namespace ffff888003679800: end of exit_ns.
create a namespace, create 2 shm segments, map them.
[ 81.517464] newseg(): shp ffff888003a84e00: shp->shm_nattch 0 ->shmfile ffff88800428f800.
[ 81.526964] do_shmat(): shp ffff888003a84e00: shp->shm_nattch 0.
[ 81.531575] do_shmat(): shp ffff888003a84e00: get_ipc_ns().
[ 81.542459] __shm_open(): before ++: shp ffff888003a84e00, sfd->file ffff88800428f800: shp->shm_nattch 1.
[ 81.549390] do_shmat() before --: shp ffff888003a84e00: shp->shm_nattch 2.
[ 81.554699] shm_may_destroy(): shp ffff888003a84e00: shp->shm_nattch 1.
[ 81.560649] newseg(): shp ffff888003a84f00: shp->shm_nattch 0 ->shmfile ffff88800428f500.
[ 81.564649] do_shmat(): shp ffff888003a84f00: shp->shm_nattch 0.
[ 81.568681] do_shmat(): shp ffff888003a84f00: get_ipc_ns().
[ 81.573865] __shm_open(): before ++: shp ffff888003a84f00, sfd->file ffff88800428f500: shp->shm_nattch 1.
[ 81.576866] do_shmat() before --: shp ffff888003a84f00: shp->shm_nattch 2.
[ 81.580494] shm_may_destroy(): shp ffff888003a84f00: shp->shm_nattch 1.
2) unshare(0x8000000).
[ 81.589648] shm_may_destroy(): shp ffff888003a84f00: shp->shm_nattch 1.
[ 81.592431] shm_may_destroy(): shp ffff888003a84e00: shp->shm_nattch 1.
Once more: Create a namespace, create 2 shm segments, map them, no auto-rm.
[ 86.609807] newseg(): shp ffff888003a84000: shp->shm_nattch 0 ->shmfile ffff888004023a00.
[ 86.613978] do_shmat(): shp ffff888003a84000: shp->shm_nattch 0.
[ 86.616616] do_shmat(): shp ffff888003a84000: get_ipc_ns().
[ 86.621714] __shm_open(): before ++: shp ffff888003a84000, sfd->file ffff888004023a00: shp->shm_nattch 1.
[ 86.625975] do_shmat() before --: shp ffff888003a84000: shp->shm_nattch 2.
[ 86.629578] shm_may_destroy(): shp ffff888003a84000: shp->shm_nattch 1.
[ 86.633766] newseg(): shp ffff888003a84100: shp->shm_nattch 0 ->shmfile ffff888004023e00.
[ 86.639642] do_shmat(): shp ffff888003a84100: shp->shm_nattch 0.
[ 86.643634] do_shmat(): shp ffff888003a84100: get_ipc_ns().
[ 86.646951] __shm_open(): before ++: shp ffff888003a84100, sfd->file ffff888004023e00: shp->shm_nattch 1.
[ 86.651648] do_shmat() before --: shp ffff888003a84100: shp->shm_nattch 2.
[ 86.660527] shm_may_destroy(): shp ffff888003a84100: shp->shm_nattch 1.
Orphan namespace (switch back to parent namespace).
Before exit of child: 4 mappings exist in 2 namespaces.
[ 91.750385] shm_close(): before --: shp ffff888003a84100: shp->shm_nattch 1.
[ 91.755503] shm_may_destroy(): shp ffff888003a84100: shp->shm_nattch 0.
[ 91.758710] shm_close(): before --: shp ffff888003a84000: shp->shm_nattch 1.
[ 91.761828] shm_may_destroy(): shp ffff888003a84000: shp->shm_nattch 0.
[ 91.764879] shm_close(): before --: shp ffff888003a84f00: shp->shm_nattch 1.
[ 91.768248] shm_may_destroy(): shp ffff888003a84f00: shp->shm_nattch 0.
[ 91.772642] shm_close(): before --: shp ffff888003a84e00: shp->shm_nattch 1.
[ 91.776417] shm_may_destroy(): shp ffff888003a84e00: shp->shm_nattch 0.
[ 91.790861] shm_release: file ffff88800428f800, put_ipc_ns().
[ 91.796858] shm_release: file ffff88800428f500, put_ipc_ns().
[ 91.805866] namespace ffff888003679400: in exit_ns.
[ 91.808458] namespace ffff888003679400: end of exit_ns.
[ 91.816613] shm_release: file ffff888004023a00, put_ipc_ns().
[ 91.821392] shm_release: file ffff888004023e00, put_ipc_ns().
[ 91.825715] namespace ffff888003679800: in exit_ns.
[ 91.828811] do_shm_rmid(): shp ffff888003a84000: shp->shm_nattch 0.
[ 91.832453] do_shm_rmid(): shp ffff888003a84100: shp->shm_nattch 0.
[ 91.843841] namespace ffff888003679800: end of exit_ns.
parent:waitpid returned 285, status 0.
[ 101.882842] namespace ffff888003679000: in exit_ns.
[ 101.885707] namespace ffff888003679000: end of exit_ns.
#