Kernel crash in cgroup_pidlist_destroy_work_fn()

From: Cong Wang
Date: Tue Sep 16 2014 - 19:56:21 EST


Hi, Tejun


We saw some kernel null pointer dereference in
cgroup_pidlist_destroy_work_fn(), more precisely at
__mutex_lock_slowpath(), on 3.14. I can show you the full stack trace
on request.

Looking at the code, it seems flush_workqueue() doesn't care about new
incoming works, it only processes currently pending ones, if this is
correct, then we could have the following race condition:

cgroup_pidlist_destroy_all():
//...
mutex_lock(&cgrp->pidlist_mutex);
list_for_each_entry_safe(l, tmp_l, &cgrp->pidlists, links)
mod_delayed_work(cgroup_pidlist_destroy_wq,
&l->destroy_dwork, 0);
mutex_unlock(&cgrp->pidlist_mutex);

// <--- another process calls cgroup_pidlist_start() here
since mutex is released

flush_workqueue(cgroup_pidlist_destroy_wq); // <--- another
process adds new pidlist and queue work in pararell
BUG_ON(!list_empty(&cgrp->pidlists)); // <--- This check is
passed, list_add() could happen after this


Therefore, the newly added pidlist will point to a freed cgroup, and
when it is freed in the delayed work we will crash.

The attached patch (compile test ONLY) could be a possible fix, since
it will check and hold a refcount on this cgroup in
cgroup_pidlist_start(). But I could very easily miss something here
since there are many cgroup changes after 3.14 and I don't follow
cgroup development.

What do you think?

Thanks.
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 940aced..2206151 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4084,6 +4084,9 @@ static void *cgroup_pidlist_start(struct seq_file *s, loff_t *pos)
int index = 0, pid = *pos;
int *iter, ret;

+ if (!cgroup_tryget(cgrp))
+ return NULL;
+
mutex_lock(&cgrp->pidlist_mutex);

/*
@@ -4132,13 +4135,15 @@ static void *cgroup_pidlist_start(struct seq_file *s, loff_t *pos)

static void cgroup_pidlist_stop(struct seq_file *s, void *v)
{
+ struct cgroup *cgrp = seq_css(s)->cgroup;
struct kernfs_open_file *of = s->private;
struct cgroup_pidlist *l = of->priv;

if (l)
mod_delayed_work(cgroup_pidlist_destroy_wq, &l->destroy_dwork,
CGROUP_PIDLIST_DESTROY_DELAY);
- mutex_unlock(&seq_css(s)->cgroup->pidlist_mutex);
+ mutex_unlock(&cgrp->pidlist_mutex);
+ cgroup_put(cgrp);
}

static void *cgroup_pidlist_next(struct seq_file *s, void *v, loff_t *pos)