Re: [PATCH][BUGFIX] cgroups: fix pid namespace bug
From: KAMEZAWA Hiroyuki
Date: Wed Jul 01 2009 - 21:59:06 EST
On Wed, 1 Jul 2009 18:36:36 -0700
Paul Menage <menage@xxxxxxxxxx> wrote:
> Thanks Li - but as I said to Serge in the email when I brought this up
> originally, I already had a patch in mind for this; I've had an intern
> (Ben) at Google working on it. His patch (pretty much ready, and being
> sent out tomorrow I hope) is pretty similar to yours, but his is on
> top of another patch that provides a (currently read-only)
> "cgroup.procs" file in each cgroup dir that lists the unique tgids in
> the cgroup. So the key in the list of pid arrays is actually a pid_ns
> and a file type (indicating procs or tasks) rather than just a pid_ns.
>
Wow, I welcome "procs" file :)
Off-topic:
Is there relationship betweem mm->owner and tgid ?
If no, is it difficult to synchronize tgid and mm->owner ?
THanks,
-Kame
> So I think it makes more sense to not use this patch, but to use the
> pair of patches that Ben's written, since they provide more overal
> functionality.
>
> Paul
>
> On Wed, Jul 1, 2009 at 6:24 PM, Li Zefan<lizf@xxxxxxxxxxxxxx> wrote:
> > The bug was introduced by commit cc31edceee04a7b87f2be48f9489ebb72d264844
> > ("cgroups: convert tasks file to use a seq_file with shared pid array").
> >
> > We cache a pid array for all threads that are opening the same "tasks"
> > file, but the pids in the array are always from the namespace of the
> > last process that opened the file, so all other threads will read pids
> > from that namespace instead of their own namespaces.
> >
> > To fix it, we maintain a list of pid arrays, which is keyed by pid_ns.
> > The list will be of length 1 at most time.
> >
> > Reported-by: Paul Menage <menage@xxxxxxxxxx>
> > Idea-by: Paul Menage <menage@xxxxxxxxxx>
> > Signed-off-by: Li Zefan <lizf@xxxxxxxxxxxxxx>
> > ---
> > Âinclude/linux/cgroup.h | Â 11 ++----
> > Âkernel/cgroup.c    Â|  94 +++++++++++++++++++++++++++++++++++------------
> > Â2 files changed, 74 insertions(+), 31 deletions(-)
> >
> > diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> > index 665fa70..20411d2 100644
> > --- a/include/linux/cgroup.h
> > +++ b/include/linux/cgroup.h
> > @@ -179,14 +179,11 @@ struct cgroup {
> > Â Â Â Â */
> > Â Â Â Âstruct list_head release_list;
> >
> > - Â Â Â /* pids_mutex protects the fields below */
> > + Â Â Â /* pids_mutex protects pids_list and cached pid arrays. */
> > Â Â Â Âstruct rw_semaphore pids_mutex;
> > - Â Â Â /* Array of process ids in the cgroup */
> > - Â Â Â pid_t *tasks_pids;
> > - Â Â Â /* How many files are using the current tasks_pids array */
> > - Â Â Â int pids_use_count;
> > - Â Â Â /* Length of the current tasks_pids array */
> > - Â Â Â int pids_length;
> > +
> > + Â Â Â /* Linked list of struct cgroup_pids */
> > + Â Â Â struct list_head pids_list;
> >
> > Â Â Â Â/* For RCU-protected deletion */
> > Â Â Â Âstruct rcu_head rcu_head;
> > diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> > index 3737a68..13dddb4 100644
> > --- a/kernel/cgroup.c
> > +++ b/kernel/cgroup.c
> > @@ -47,6 +47,7 @@
> > Â#include <linux/hash.h>
> > Â#include <linux/namei.h>
> > Â#include <linux/smp_lock.h>
> > +#include <linux/pid_namespace.h>
> >
> > Â#include <asm/atomic.h>
> >
> > @@ -960,6 +961,7 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
> > Â Â Â ÂINIT_LIST_HEAD(&cgrp->children);
> > Â Â Â ÂINIT_LIST_HEAD(&cgrp->css_sets);
> > Â Â Â ÂINIT_LIST_HEAD(&cgrp->release_list);
> > + Â Â Â INIT_LIST_HEAD(&cgrp->pids_list);
> > Â Â Â Âinit_rwsem(&cgrp->pids_mutex);
> > Â}
> > Âstatic void init_cgroup_root(struct cgroupfs_root *root)
> > @@ -2201,12 +2203,30 @@ err:
> > Â Â Â Âreturn ret;
> > Â}
> >
> > +/*
> > + * Cache pids for all threads in the same pid namespace that are
> > + * opening the same "tasks" file.
> > + */
> > +struct cgroup_pids {
> > + Â Â Â /* The node in cgrp->pids_list */
> > + Â Â Â struct list_head list;
> > + Â Â Â /* The cgroup those pids belong to */
> > + Â Â Â struct cgroup *cgrp;
> > + Â Â Â /* The namepsace those pids belong to */
> > + Â Â Â struct pid_namespace *pid_ns;
> > + Â Â Â /* Array of process ids in the cgroup */
> > + Â Â Â pid_t *tasks_pids;
> > + Â Â Â /* How many files are using the this tasks_pids array */
> > + Â Â Â int pids_use_count;
> > + Â Â Â /* Length of the current tasks_pids array */
> > + Â Â Â int pids_length;
> > +};
> > +
> > Âstatic int cmppid(const void *a, const void *b)
> > Â{
> > Â Â Â Âreturn *(pid_t *)a - *(pid_t *)b;
> > Â}
> >
> > -
> > Â/*
> > Â* seq_file methods for the "tasks" file. The seq_file position is the
> > Â* next pid to display; the seq_file iterator is a pointer to the pid
> > @@ -2221,45 +2241,47 @@ static void *cgroup_tasks_start(struct seq_file *s, loff_t *pos)
> > Â Â Â Â * after a seek to the start). Use a binary-search to find the
> > Â Â Â Â * next pid to display, if any
> > Â Â Â Â */
> > - Â Â Â struct cgroup *cgrp = s->private;
> > + Â Â Â struct cgroup_pids *cp = s->private;
> > + Â Â Â struct cgroup *cgrp = cp->cgrp;
> > Â Â Â Âint index = 0, pid = *pos;
> > Â Â Â Âint *iter;
> >
> > Â Â Â Âdown_read(&cgrp->pids_mutex);
> > Â Â Â Âif (pid) {
> > - Â Â Â Â Â Â Â int end = cgrp->pids_length;
> > + Â Â Â Â Â Â Â int end = cp->pids_length;
> >
> > Â Â Â Â Â Â Â Âwhile (index < end) {
> > Â Â Â Â Â Â Â Â Â Â Â Âint mid = (index + end) / 2;
> > - Â Â Â Â Â Â Â Â Â Â Â if (cgrp->tasks_pids[mid] == pid) {
> > + Â Â Â Â Â Â Â Â Â Â Â if (cp->tasks_pids[mid] == pid) {
> > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âindex = mid;
> > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âbreak;
> > - Â Â Â Â Â Â Â Â Â Â Â } else if (cgrp->tasks_pids[mid] <= pid)
> > + Â Â Â Â Â Â Â Â Â Â Â } else if (cp->tasks_pids[mid] <= pid)
> > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âindex = mid + 1;
> > Â Â Â Â Â Â Â Â Â Â Â Âelse
> > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âend = mid;
> > Â Â Â Â Â Â Â Â}
> > Â Â Â Â}
> > Â Â Â Â/* If we're off the end of the array, we're done */
> > - Â Â Â if (index >= cgrp->pids_length)
> > + Â Â Â if (index >= cp->pids_length)
> > Â Â Â Â Â Â Â Âreturn NULL;
> > Â Â Â Â/* Update the abstract position to be the actual pid that we found */
> > - Â Â Â iter = cgrp->tasks_pids + index;
> > + Â Â Â iter = cp->tasks_pids + index;
> > Â Â Â Â*pos = *iter;
> > Â Â Â Âreturn iter;
> > Â}
> >
> > Âstatic void cgroup_tasks_stop(struct seq_file *s, void *v)
> > Â{
> > - Â Â Â struct cgroup *cgrp = s->private;
> > + Â Â Â struct cgroup_pids *cp = s->private;
> > + Â Â Â struct cgroup *cgrp = cp->cgrp;
> > Â Â Â Âup_read(&cgrp->pids_mutex);
> > Â}
> >
> > Âstatic void *cgroup_tasks_next(struct seq_file *s, void *v, loff_t *pos)
> > Â{
> > - Â Â Â struct cgroup *cgrp = s->private;
> > + Â Â Â struct cgroup_pids *cp = s->private;
> > Â Â Â Âint *p = v;
> > - Â Â Â int *end = cgrp->tasks_pids + cgrp->pids_length;
> > + Â Â Â int *end = cp->tasks_pids + cp->pids_length;
> >
> > Â Â Â Â/*
> > Â Â Â Â * Advance to the next pid in the array. If this goes off the
> > @@ -2286,26 +2308,32 @@ static struct seq_operations cgroup_tasks_seq_operations = {
> > Â Â Â Â.show = cgroup_tasks_show,
> > Â};
> >
> > -static void release_cgroup_pid_array(struct cgroup *cgrp)
> > +static void release_cgroup_pid_array(struct cgroup_pids *cp)
> > Â{
> > + Â Â Â struct cgroup *cgrp = cp->cgrp;
> > +
> > Â Â Â Âdown_write(&cgrp->pids_mutex);
> > - Â Â Â BUG_ON(!cgrp->pids_use_count);
> > - Â Â Â if (!--cgrp->pids_use_count) {
> > - Â Â Â Â Â Â Â kfree(cgrp->tasks_pids);
> > - Â Â Â Â Â Â Â cgrp->tasks_pids = NULL;
> > - Â Â Â Â Â Â Â cgrp->pids_length = 0;
> > + Â Â Â BUG_ON(!cp->pids_use_count);
> > + Â Â Â if (!--cp->pids_use_count) {
> > + Â Â Â Â Â Â Â list_del(&cp->list);
> > + Â Â Â Â Â Â Â kfree(cp->tasks_pids);
> > + Â Â Â Â Â Â Â kfree(cp);
> > Â Â Â Â}
> > Â Â Â Âup_write(&cgrp->pids_mutex);
> > Â}
> >
> > Âstatic int cgroup_tasks_release(struct inode *inode, struct file *file)
> > Â{
> > - Â Â Â struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent);
> > + Â Â Â struct seq_file *seq;
> > + Â Â Â struct cgroup_pids *cp;
> >
> > Â Â Â Âif (!(file->f_mode & FMODE_READ))
> > Â Â Â Â Â Â Â Âreturn 0;
> >
> > - Â Â Â release_cgroup_pid_array(cgrp);
> > + Â Â Â seq = file->private_data;
> > + Â Â Â cp = seq->private;
> > +
> > + Â Â Â release_cgroup_pid_array(cp);
> > Â Â Â Âreturn seq_release(inode, file);
> > Â}
> >
> > @@ -2324,6 +2352,8 @@ static struct file_operations cgroup_tasks_operations = {
> > Âstatic int cgroup_tasks_open(struct inode *unused, struct file *file)
> > Â{
> > Â Â Â Âstruct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent);
> > + Â Â Â struct pid_namespace *pid_ns = task_active_pid_ns(current);
> > + Â Â Â struct cgroup_pids *cp;
> > Â Â Â Âpid_t *pidarray;
> > Â Â Â Âint npids;
> > Â Â Â Âint retval;
> > @@ -2350,20 +2380,36 @@ static int cgroup_tasks_open(struct inode *unused, struct file *file)
> > Â Â Â Â * array if necessary
> > Â Â Â Â */
> > Â Â Â Âdown_write(&cgrp->pids_mutex);
> > - Â Â Â kfree(cgrp->tasks_pids);
> > - Â Â Â cgrp->tasks_pids = pidarray;
> > - Â Â Â cgrp->pids_length = npids;
> > - Â Â Â cgrp->pids_use_count++;
> > +
> > + Â Â Â list_for_each_entry(cp, &cgrp->pids_list, list) {
> > + Â Â Â Â Â Â Â if (pid_ns == cp->pid_ns)
> > + Â Â Â Â Â Â Â Â Â Â Â goto found;
> > + Â Â Â }
> > +
> > + Â Â Â cp = kzalloc(sizeof(*cp), GFP_KERNEL);
> > + Â Â Â if (!cp) {
> > + Â Â Â Â Â Â Â up_write(&cgrp->pids_mutex);
> > + Â Â Â Â Â Â Â kfree(pidarray);
> > + Â Â Â Â Â Â Â return -ENOMEM;
> > + Â Â Â }
> > + Â Â Â cp->cgrp = cgrp;
> > + Â Â Â cp->pid_ns = pid_ns;
> > + Â Â Â list_add(&cp->list, &cgrp->pids_list);
> > +found:
> > + Â Â Â kfree(cp->tasks_pids);
> > + Â Â Â cp->tasks_pids = pidarray;
> > + Â Â Â cp->pids_length = npids;
> > + Â Â Â cp->pids_use_count++;
> > Â Â Â Âup_write(&cgrp->pids_mutex);
> >
> > Â Â Â Âfile->f_op = &cgroup_tasks_operations;
> >
> > Â Â Â Âretval = seq_open(file, &cgroup_tasks_seq_operations);
> > Â Â Â Âif (retval) {
> > - Â Â Â Â Â Â Â release_cgroup_pid_array(cgrp);
> > + Â Â Â Â Â Â Â release_cgroup_pid_array(cp);
> > Â Â Â Â Â Â Â Âreturn retval;
> > Â Â Â Â}
> > - Â Â Â ((struct seq_file *)file->private_data)->private = cgrp;
> > + Â Â Â ((struct seq_file *)file->private_data)->private = cp;
> > Â Â Â Âreturn 0;
> > Â}
> >
> > --
> > 1.5.4.rc3
> >
> _______________________________________________
> Containers mailing list
> Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx
> https://lists.linux-foundation.org/mailman/listinfo/containers
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/