Re: [PATCH 0/2] shm: omit forced shm destroy if task IPC namespace was changed

From: Alexander Mihalicyn
Date: Sat Jul 10 2021 - 07:01:40 EST

On Sat, Jul 10, 2021 at 4:12 AM Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Tue, 6 Jul 2021 16:22:57 +0300 Alexander Mikhalitsyn <alexander.mikhalitsyn@xxxxxxxxxxxxx> wrote:
> > Hello,
> >
> > Task IPC namespace shm's has shm_rmid_forced feature which is per IPC namespace
> > and controlled by kernel.shm_rmid_forced sysctl. When feature is turned on,
> > then during task exit (and unshare(CLONE_NEWIPC)) all sysvshm's will be destroyed
> > by exit_shm(struct task_struct *task) function. But there is a problem if task
> > was changed IPC namespace since shmget() call. In such situation exit_shm() function
> > will try to call
> > shm_destroy(<new_ipc_namespace_ptr>, <sysvshmem_from_old_ipc_namespace>)
> > which leads to the situation when sysvshm object still attached to old
> > IPC namespace but freed; later during old IPC namespace cleanup we will try to
> > free such sysvshm object for the second time and will get the problem :)
> >
> > First patch solves this problem by postponing shm_destroy to the moment when
> > IPC namespace cleanup will be called.
> > Second patch is useful to prevent (or easy catch) such bugs in the future by
> > adding corresponding WARNings.
> >


thanks for your attention to the patches!

> (cc's added)
> I assume that a
> Fixes: b34a6b1da371ed8af ("ipc: introduce shm_rmid_forced sysctl") is
> appropriate here?

Really not, this patch looks fully safe because it always takes
objects to free from
concrete IPC namespace idr with appropriate locking. For example
/* Destroy all already created segments, but not mapped yet */
idr_for_each(&shm_ids(ns).ipcs_idr, &shm_try_destroy_current, ns);
here we take ns from current->nsproxy, lock idr, and destroy only shms
from this particular IPC ns.

But after b34a6b1da ("shm: make exit_shm work proportional to task
activity") we introduced
new field (struct sysv_shm sysvshm) on task_struct. exit_shm()
function was changed so:

list_for_each_entry_safe(shp, n, &task->sysvshm.shm_clist, shm_clist)
shm_mark_orphan(shp, ns);

instead of previous idr_for_each(&shm_ids(ns).ipcs_idr,
&shm_try_destroy_current, ns);

Now, using setns() syscall, we can construct situation when on
task->sysvshm.shm_clist list
we have shm items from several (!) IPC namespaces. Before this patch
we always destroying
shmems only from current->nsproxy->ipc_ns, but now we can do something
like this:

shmget(0xAAAA, 4096, IPC_CREAT|0700); <-- add item to task->sysvshm.shm_clist
setns(fd, CLONE_NEWIPC);
shmget(0xAAAA, 4096, IPC_CREAT|0700); <-- add item to task->sysvshm.shm_clist
*now we have two items in task->sysvshm.shm_clist but from different
IPC namespaces*

(I've reproducer program and can send it privately)

That's a problem. If we take a look on
int ksys_unshare(unsigned long unshare_flags)

we can see following part:

if (unshare_flags & CLONE_NEWIPC) {
/* Orphan segments in old ns (see sem above). */

here we cleaning up this list BEFORE changing IPC namespace. So, if we
do something like:
shmget(0xAAAA, 4096, IPC_CREAT|0700); <-- add item to task->sysvshm.shm_clist
unshare(CLONE_NEWIPC); <-- task->sysvshm.shm_clist is cleaned and
reinitialized here
shmget(0xAAAA, 4096, IPC_CREAT|0700); <-- add item to task->sysvshm.shm_clist

and all fine!

So, semantics of setns() and unshare() is different here. We can fix
this problem by adding
analogical calls to exit_shm(), shm_init_task() into

static void commit_nsset(struct nsset *nsset)
if (flags & CLONE_NEWIPC) {
+ shm_init_task(current);
+ exit_shm(current);

with this change semantics of unshare() and setns() will be equal in
terms of the shm_rmid_forced
feature. But this may break some applications which using setns() and
IPC resources not destroying
after that syscall. (CRIU using setns() extensively and we have to
change our IPC ns C/R implementation
a little bit if we take this way of fixing the problem).

I've proposed a change which keeps the old behaviour of setns() but
fixes double free.

> A double-free is serious. Should this fix be backported into earlier
> kernels?

Yes, the IPC namespace code was not changed seriously, so it means
that we can easily apply this patch to older kernels.
(I've CCed stable lists in the patch where the fix was)

CCed: Andrei Vagin and Christian Brauner