Re: [PATCH v5 1/4] fuse: Add support for pid namespaces

From: Seth Forshee
Date: Tue Nov 11 2014 - 10:24:41 EST


On Tue, Nov 11, 2014 at 02:27:34PM +0100, Miklos Szeredi wrote:
> On Wed, Oct 22, 2014 at 04:24:17PM -0500, Seth Forshee wrote:
> > If the userspace process servicing fuse requests is running in
> > a pid namespace then pids passed via the fuse fd need to be
> > translated relative to that namespace. Capture the pid namespace
> > in use when the filesystem is mounted and use this for pid
> > translation.
>
> But WHY?
>
> I know it's been discussed, but the reasons should be spelled out here:

The reasons relate much more to user namespaces rather than pid
namespaces, but in my opinion it doesn't make much sense to handle them
differently. It might make more sense to reorder the patches to add user
namespace support first, give the justification there, and then for this
one just say "to be consistent with userns support."

> - capturing the namespace makes the code less complex (the strongest argument
> that I've heard)

For pid namespaces it's not so bad, except maybe for file locking (I
can't remember for sure about that). For user namespaces though it does
get much more complex to use the namespace from the /dev/fuse IO path,
because frequently the raw data from userspace is being handed off to
another thread before it is interpreted. That means we either have to
capture the userns and pass it along with the data or restructure the
code to do all of this in the IO paths. Either way having a static
namespace in fuse_conn works out to be much simpler.

> - there's a security concern with translating to current namespace (that I
> still don't fully understand)

There's the one I explained with suid, which in reality should be
mitigated by some of the other protections in fuse. Other than that I
don't know of anything specific Eric had in mind other than the fact
that other sources of userns security bugs have been "time of open
versus time of use" sorts of problems, and he believes that capturing
the userns at IO time instead of mount time is likely to be similarly
vulnerable.

Otherwise I can't add any specifics beyond what Eric has already said in
http://lkml.kernel.org/g/87egv26mcz.fsf@xxxxxxxxxxxxxxxxxxxxxx Maybe
Eric or Andy could elaborate further. I'll also look at some of the past
vulnerabilities and see if I get any ideas for specific attacks.

> - changing the pid namespace after mounting is unlikely to be of any use

This sounds like trying to prove a negative. All I can say is that no
one has stepped forward to voice any specific use cases which would
require it.

> - if it turns out to be useful, we can still fix this later (can we?)

I don't see why we couldn't as long as nothing in userspace came to rely
upon the behavior.

> What happens when the server does indeed change pid namespace after mounting?
> Will just receive bogus pid values? Shouldn't it receive an error instead?

Yeah, I suppose it does receive bogus pids and userid values. About all
we could do to prevent this is make the /dev/fuse read/write paths
return an error if the current namespace isn't the same as the one at
mount time. But then requests could get stuck in the queue forever, so
maybe we should also fail all requests in the queue when this happens.
Unless you have a better idea?

>
> More comments inline.
>
> > File locking changes based on previous work done by Eric
> > Biederman.
> >
> > Cc: Eric W. Biederman <ebiederm@xxxxxxxxxxxx>
> > Cc: Serge H. Hallyn <serge.hallyn@xxxxxxxxxx>
> > Cc: Andy Lutomirski <luto@xxxxxxxxxxxxxx>
> > Signed-off-by: Seth Forshee <seth.forshee@xxxxxxxxxxxxx>
> > ---
> > fs/fuse/dev.c | 9 +++++----
> > fs/fuse/file.c | 38 +++++++++++++++++++++++++++-----------
> > fs/fuse/fuse_i.h | 4 ++++
> > fs/fuse/inode.c | 4 ++++
> > 4 files changed, 40 insertions(+), 15 deletions(-)
> >
> > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> > index ca887314aba9..839caebd34f1 100644
> > --- a/fs/fuse/dev.c
> > +++ b/fs/fuse/dev.c
> > @@ -20,6 +20,7 @@
> > #include <linux/swap.h>
> > #include <linux/splice.h>
> > #include <linux/aio.h>
> > +#include <linux/sched.h>
> >
> > MODULE_ALIAS_MISCDEV(FUSE_MINOR);
> > MODULE_ALIAS("devname:fuse");
> > @@ -124,11 +125,11 @@ static void __fuse_put_request(struct fuse_req *req)
> > atomic_dec(&req->count);
> > }
> >
> > -static void fuse_req_init_context(struct fuse_req *req)
> > +static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
> > {
> > req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
> > req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
> > - req->in.h.pid = current->pid;
> > + req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
> > }
> >
> > static bool fuse_block_alloc(struct fuse_conn *fc, bool for_background)
> > @@ -168,7 +169,7 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
> > goto out;
> > }
> >
> > - fuse_req_init_context(req);
> > + fuse_req_init_context(fc, req);
> > req->waiting = 1;
> > req->background = for_background;
> > return req;
> > @@ -257,7 +258,7 @@ struct fuse_req *fuse_get_req_nofail_nopages(struct fuse_conn *fc,
> > if (!req)
> > req = get_reserved_req(fc, file);
> >
> > - fuse_req_init_context(req);
> > + fuse_req_init_context(fc, req);
> > req->waiting = 1;
> > req->background = 0;
> > return req;
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index caa8d95b24e8..cb0e40ecc362 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -2131,7 +2131,8 @@ static int fuse_direct_mmap(struct file *file, struct vm_area_struct *vma)
> > return generic_file_mmap(file, vma);
> > }
> >
> > -static int convert_fuse_file_lock(const struct fuse_file_lock *ffl,
> > +static int convert_fuse_file_lock(struct fuse_conn *fc,
> > + const struct fuse_file_lock *ffl,
> > struct file_lock *fl)
> > {
> > switch (ffl->type) {
> > @@ -2146,7 +2147,11 @@ static int convert_fuse_file_lock(const struct fuse_file_lock *ffl,
> >
> > fl->fl_start = ffl->start;
> > fl->fl_end = ffl->end;
> > - fl->fl_pid = ffl->pid;
> > + rcu_read_lock();
> > + fl->fl_pid = pid_vnr(find_pid_ns(ffl->pid, fc->pid_ns));
> > + rcu_read_unlock();
> > + if (ffl->pid != 0 && fl->fl_pid == 0)
> > + return -EIO;
>
> This needs some comments: is this trying to translate the pid backwards? Why is
> it not checking the return of find_pid_ns() then? The man page documents l_pid
> value of -1 but not of 0, so why are we checking for "ffl->pid != 0"? Or is the
> man page incomplete and in practice we get l_pid == 0 values?

I'll add comments. It's converting the pid in the fuse_file_lock into
the current pid namespace. pid_vnr calls pid_nr_ns, which returns 0 if
the pid can't be translated into the namespace, thus we return the
error.

pid_vnr's return values don't necessarily conform to the expectations of
the fcntl syscall in all cases, and as far as I can tell it should never
return <0. But if fcntl doesn't expect l_pid == 0 and pid_vnr could
return that value then doesn't it makes sense for fuse to return an
error in cases where this would happen?

Fwiw, this is also an example of a case where it's simpler to have a
static namespace. If we were to use the current ns from the /dev/fuse IO
path then we either have to process the request there or grab a
reference to the ns there and pass it alongside the request (and we
have to do it for all requests unless the IO path is going to look at
the request type and decide whether or not it requires a reference to
the namespace).

>
>
> > break;
> >
> > default:
> > @@ -2156,9 +2161,9 @@ static int convert_fuse_file_lock(const struct fuse_file_lock *ffl,
> > return 0;
> > }
> >
> > -static void fuse_lk_fill(struct fuse_req *req, struct file *file,
> > - const struct file_lock *fl, int opcode, pid_t pid,
> > - int flock)
> > +static int fuse_lk_fill(struct fuse_req *req, struct file *file,
> > + const struct file_lock *fl, int opcode,
> > + struct pid *pid, int flock)
> > {
> > struct inode *inode = file_inode(file);
> > struct fuse_conn *fc = get_fuse_conn(inode);
> > @@ -2170,7 +2175,9 @@ static void fuse_lk_fill(struct fuse_req *req, struct file *file,
> > arg->lk.start = fl->fl_start;
> > arg->lk.end = fl->fl_end;
> > arg->lk.type = fl->fl_type;
> > - arg->lk.pid = pid;
> > + arg->lk.pid = pid_nr_ns(pid, fc->pid_ns);
> > + if (pid && arg->lk.pid == 0)
> > + return -EOVERFLOW;
>
> Could have done the conversion immediately after getting 'pid' with task_tgid(),
> then the changes would have been smaller and more localized.

The changes would be very marginally smaller since currently only one
caller of fuse_lk_fill which passes a non-zero pid. If additional
callers were ever added with non-zero pids then there would be
duplication of this code. But I'll do it whichever way you prefer, just
let me know.

> > if (flock)
> > arg->lk_flags |= FUSE_LK_FLOCK;
> > req->in.h.opcode = opcode;
> > @@ -2178,6 +2185,8 @@ static void fuse_lk_fill(struct fuse_req *req, struct file *file,
> > req->in.numargs = 1;
> > req->in.args[0].size = sizeof(*arg);
> > req->in.args[0].value = arg;
> > +
> > + return 0;
> > }
> >
> > static int fuse_getlk(struct file *file, struct file_lock *fl)
> > @@ -2192,16 +2201,19 @@ static int fuse_getlk(struct file *file, struct file_lock *fl)
> > if (IS_ERR(req))
> > return PTR_ERR(req);
> >
> > - fuse_lk_fill(req, file, fl, FUSE_GETLK, 0, 0);
> > + err = fuse_lk_fill(req, file, fl, FUSE_GETLK, NULL, 0);
> > + if (err)
> > + goto out;
> > req->out.numargs = 1;
> > req->out.args[0].size = sizeof(outarg);
> > req->out.args[0].value = &outarg;
> > fuse_request_send(fc, req);
> > err = req->out.h.error;
> > - fuse_put_request(fc, req);
> > if (!err)
> > - err = convert_fuse_file_lock(&outarg.lk, fl);
> > + err = convert_fuse_file_lock(fc, &outarg.lk, fl);
> >
> > +out:
> > + fuse_put_request(fc, req);
> > return err;
> > }
> >
> > @@ -2211,7 +2223,7 @@ static int fuse_setlk(struct file *file, struct file_lock *fl, int flock)
> > struct fuse_conn *fc = get_fuse_conn(inode);
> > struct fuse_req *req;
> > int opcode = (fl->fl_flags & FL_SLEEP) ? FUSE_SETLKW : FUSE_SETLK;
> > - pid_t pid = fl->fl_type != F_UNLCK ? current->tgid : 0;
> > + struct pid *pid = fl->fl_type != F_UNLCK ? task_tgid(current) : NULL;
> > int err;
> >
> > if (fl->fl_lmops && fl->fl_lmops->lm_grant) {
> > @@ -2227,12 +2239,16 @@ static int fuse_setlk(struct file *file, struct file_lock *fl, int flock)
> > if (IS_ERR(req))
> > return PTR_ERR(req);
> >
> > - fuse_lk_fill(req, file, fl, opcode, pid, flock);
> > + err = fuse_lk_fill(req, file, fl, opcode, pid, flock);
> > + if (err)
> > + goto out;
> > fuse_request_send(fc, req);
> > err = req->out.h.error;
> > /* locking is restartable */
> > if (err == -EINTR)
> > err = -ERESTARTSYS;
> > +
> > +out:
> > fuse_put_request(fc, req);
> > return err;
> > }
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index e8e47a6ab518..a3ded071e2c6 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -22,6 +22,7 @@
> > #include <linux/rbtree.h>
> > #include <linux/poll.h>
> > #include <linux/workqueue.h>
> > +#include <linux/pid_namespace.h>
> >
> > /** Max number of pages that can be used in a single read request */
> > #define FUSE_MAX_PAGES_PER_REQ 32
> > @@ -386,6 +387,9 @@ struct fuse_conn {
> > /** The group id for this mount */
> > kgid_t group_id;
> >
> > + /** The pid namespace for this mount */
> > + struct pid_namespace *pid_ns;
> > +
> > /** The fuse mount flags for this mount */
> > unsigned flags;
> >
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index 03246cd9d47a..e137969815a3 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -20,6 +20,7 @@
> > #include <linux/random.h>
> > #include <linux/sched.h>
> > #include <linux/exportfs.h>
> > +#include <linux/pid_namespace.h>
> >
> > MODULE_AUTHOR("Miklos Szeredi <miklos@xxxxxxxxxx>");
> > MODULE_DESCRIPTION("Filesystem in Userspace");
> > @@ -616,6 +617,7 @@ void fuse_conn_init(struct fuse_conn *fc)
> > fc->initialized = 0;
> > fc->attr_version = 1;
> > get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
> > + fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
> > }
> > EXPORT_SYMBOL_GPL(fuse_conn_init);
> >
> > @@ -624,6 +626,8 @@ void fuse_conn_put(struct fuse_conn *fc)
> > if (atomic_dec_and_test(&fc->count)) {
> > if (fc->destroy_req)
> > fuse_request_free(fc->destroy_req);
> > + put_pid_ns(fc->pid_ns);
> > + fc->pid_ns = NULL;
>
> We don't usually zero out fields before freeing. There are debugging config
> options to do that for us.

Okay, I'll remove that line.

Thanks,
Seth

>
> > fc->release(fc);
> > }
> > }
> > --
> > 1.9.1
> >
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/