Re: [PATCH v5 1/4] fuse: Add support for pid namespaces
From: Andy Lutomirski
Date: Tue Nov 11 2014 - 10:39:37 EST
On Tue, Nov 11, 2014 at 7:24 AM, Seth Forshee
<seth.forshee@xxxxxxxxxxxxx> wrote:
> On Tue, Nov 11, 2014 at 02:27:34PM +0100, Miklos Szeredi wrote:
>> On Wed, Oct 22, 2014 at 04:24:17PM -0500, Seth Forshee wrote:
>> > If the userspace process servicing fuse requests is running in
>> > a pid namespace then pids passed via the fuse fd need to be
>> > translated relative to that namespace. Capture the pid namespace
>> > in use when the filesystem is mounted and use this for pid
>> > translation.
>>
>> But WHY?
>>
>> I know it's been discussed, but the reasons should be spelled out here:
>
> The reasons relate much more to user namespaces rather than pid
> namespaces, but in my opinion it doesn't make much sense to handle them
> differently. It might make more sense to reorder the patches to add user
> namespace support first, give the justification there, and then for this
> one just say "to be consistent with userns support."
>
>> - capturing the namespace makes the code less complex (the strongest argument
>> that I've heard)
>
> For pid namespaces it's not so bad, except maybe for file locking (I
> can't remember for sure about that). For user namespaces though it does
> get much more complex to use the namespace from the /dev/fuse IO path,
> because frequently the raw data from userspace is being handed off to
> another thread before it is interpreted. That means we either have to
> capture the userns and pass it along with the data or restructure the
> code to do all of this in the IO paths. Either way having a static
> namespace in fuse_conn works out to be much simpler.
>
>> - there's a security concern with translating to current namespace (that I
>> still don't fully understand)
>
> There's the one I explained with suid, which in reality should be
> mitigated by some of the other protections in fuse. Other than that I
> don't know of anything specific Eric had in mind other than the fact
> that other sources of userns security bugs have been "time of open
> versus time of use" sorts of problems, and he believes that capturing
> the userns at IO time instead of mount time is likely to be similarly
> vulnerable.
>
The general concern is that read(2) and write(2) really have no
business looking at creds. ioctl is considerably less dangerous. The
risk is that you get an fd and use at as stderr for a setuid process
or something along those lines.
> Otherwise I can't add any specifics beyond what Eric has already said in
> http://lkml.kernel.org/g/87egv26mcz.fsf@xxxxxxxxxxxxxxxxxxxxxx Maybe
> Eric or Andy could elaborate further. I'll also look at some of the past
> vulnerabilities and see if I get any ideas for specific attacks.
>
>> - changing the pid namespace after mounting is unlikely to be of any use
>
> This sounds like trying to prove a negative. All I can say is that no
> one has stepped forward to voice any specific use cases which would
> require it.
>
>> - if it turns out to be useful, we can still fix this later (can we?)
>
> I don't see why we couldn't as long as nothing in userspace came to rely
> upon the behavior.
Just to be clear, the current behavior is to reject /dev/fuse io
attempts with namespaces that don't match the opener, right? If so,
that seems like the conservative option, and changing it won't break
things.
>
>> What happens when the server does indeed change pid namespace after mounting?
>> Will just receive bogus pid values? Shouldn't it receive an error instead?
>
> Yeah, I suppose it does receive bogus pids and userid values. About all
> we could do to prevent this is make the /dev/fuse read/write paths
> return an error if the current namespace isn't the same as the one at
> mount time. But then requests could get stuck in the queue forever, so
> maybe we should also fail all requests in the queue when this happens.
> Unless you have a better idea?
>
If this is actually read(2)/write(2), then I think you need to either
error out or use the namespaces at the time of open. Anything else is
asking for trouble when the caller of read(2) or write(2) doesn't
realize that the fd is a /dev/fuse fd.
--Andy
>>
>> More comments inline.
>>
>> > File locking changes based on previous work done by Eric
>> > Biederman.
>> >
>> > Cc: Eric W. Biederman <ebiederm@xxxxxxxxxxxx>
>> > Cc: Serge H. Hallyn <serge.hallyn@xxxxxxxxxx>
>> > Cc: Andy Lutomirski <luto@xxxxxxxxxxxxxx>
>> > Signed-off-by: Seth Forshee <seth.forshee@xxxxxxxxxxxxx>
>> > ---
>> > fs/fuse/dev.c | 9 +++++----
>> > fs/fuse/file.c | 38 +++++++++++++++++++++++++++-----------
>> > fs/fuse/fuse_i.h | 4 ++++
>> > fs/fuse/inode.c | 4 ++++
>> > 4 files changed, 40 insertions(+), 15 deletions(-)
>> >
>> > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
>> > index ca887314aba9..839caebd34f1 100644
>> > --- a/fs/fuse/dev.c
>> > +++ b/fs/fuse/dev.c
>> > @@ -20,6 +20,7 @@
>> > #include <linux/swap.h>
>> > #include <linux/splice.h>
>> > #include <linux/aio.h>
>> > +#include <linux/sched.h>
>> >
>> > MODULE_ALIAS_MISCDEV(FUSE_MINOR);
>> > MODULE_ALIAS("devname:fuse");
>> > @@ -124,11 +125,11 @@ static void __fuse_put_request(struct fuse_req *req)
>> > atomic_dec(&req->count);
>> > }
>> >
>> > -static void fuse_req_init_context(struct fuse_req *req)
>> > +static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
>> > {
>> > req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
>> > req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
>> > - req->in.h.pid = current->pid;
>> > + req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
>> > }
>> >
>> > static bool fuse_block_alloc(struct fuse_conn *fc, bool for_background)
>> > @@ -168,7 +169,7 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
>> > goto out;
>> > }
>> >
>> > - fuse_req_init_context(req);
>> > + fuse_req_init_context(fc, req);
>> > req->waiting = 1;
>> > req->background = for_background;
>> > return req;
>> > @@ -257,7 +258,7 @@ struct fuse_req *fuse_get_req_nofail_nopages(struct fuse_conn *fc,
>> > if (!req)
>> > req = get_reserved_req(fc, file);
>> >
>> > - fuse_req_init_context(req);
>> > + fuse_req_init_context(fc, req);
>> > req->waiting = 1;
>> > req->background = 0;
>> > return req;
>> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>> > index caa8d95b24e8..cb0e40ecc362 100644
>> > --- a/fs/fuse/file.c
>> > +++ b/fs/fuse/file.c
>> > @@ -2131,7 +2131,8 @@ static int fuse_direct_mmap(struct file *file, struct vm_area_struct *vma)
>> > return generic_file_mmap(file, vma);
>> > }
>> >
>> > -static int convert_fuse_file_lock(const struct fuse_file_lock *ffl,
>> > +static int convert_fuse_file_lock(struct fuse_conn *fc,
>> > + const struct fuse_file_lock *ffl,
>> > struct file_lock *fl)
>> > {
>> > switch (ffl->type) {
>> > @@ -2146,7 +2147,11 @@ static int convert_fuse_file_lock(const struct fuse_file_lock *ffl,
>> >
>> > fl->fl_start = ffl->start;
>> > fl->fl_end = ffl->end;
>> > - fl->fl_pid = ffl->pid;
>> > + rcu_read_lock();
>> > + fl->fl_pid = pid_vnr(find_pid_ns(ffl->pid, fc->pid_ns));
>> > + rcu_read_unlock();
>> > + if (ffl->pid != 0 && fl->fl_pid == 0)
>> > + return -EIO;
>>
>> This needs some comments: is this trying to translate the pid backwards? Why is
>> it not checking the return of find_pid_ns() then? The man page documents l_pid
>> value of -1 but not of 0, so why are we checking for "ffl->pid != 0"? Or is the
>> man page incomplete and in practice we get l_pid == 0 values?
>
> I'll add comments. It's converting the pid in the fuse_file_lock into
> the current pid namespace. pid_vnr calls pid_nr_ns, which returns 0 if
> the pid can't be translated into the namespace, thus we return the
> error.
>
> pid_vnr's return values don't necessarily conform to the expectations of
> the fcntl syscall in all cases, and as far as I can tell it should never
> return <0. But if fcntl doesn't expect l_pid == 0 and pid_vnr could
> return that value then doesn't it makes sense for fuse to return an
> error in cases where this would happen?
>
> Fwiw, this is also an example of a case where it's simpler to have a
> static namespace. If we were to use the current ns from the /dev/fuse IO
> path then we either have to process the request there or grab a
> reference to the ns there and pass it alongside the request (and we
> have to do it for all requests unless the IO path is going to look at
> the request type and decide whether or not it requires a reference to
> the namespace).
>
>>
>>
>> > break;
>> >
>> > default:
>> > @@ -2156,9 +2161,9 @@ static int convert_fuse_file_lock(const struct fuse_file_lock *ffl,
>> > return 0;
>> > }
>> >
>> > -static void fuse_lk_fill(struct fuse_req *req, struct file *file,
>> > - const struct file_lock *fl, int opcode, pid_t pid,
>> > - int flock)
>> > +static int fuse_lk_fill(struct fuse_req *req, struct file *file,
>> > + const struct file_lock *fl, int opcode,
>> > + struct pid *pid, int flock)
>> > {
>> > struct inode *inode = file_inode(file);
>> > struct fuse_conn *fc = get_fuse_conn(inode);
>> > @@ -2170,7 +2175,9 @@ static void fuse_lk_fill(struct fuse_req *req, struct file *file,
>> > arg->lk.start = fl->fl_start;
>> > arg->lk.end = fl->fl_end;
>> > arg->lk.type = fl->fl_type;
>> > - arg->lk.pid = pid;
>> > + arg->lk.pid = pid_nr_ns(pid, fc->pid_ns);
>> > + if (pid && arg->lk.pid == 0)
>> > + return -EOVERFLOW;
>>
>> Could have done the conversion immediately after getting 'pid' with task_tgid(),
>> then the changes would have been smaller and more localized.
>
> The changes would be very marginally smaller since currently only one
> caller of fuse_lk_fill which passes a non-zero pid. If additional
> callers were ever added with non-zero pids then there would be
> duplication of this code. But I'll do it whichever way you prefer, just
> let me know.
>
>> > if (flock)
>> > arg->lk_flags |= FUSE_LK_FLOCK;
>> > req->in.h.opcode = opcode;
>> > @@ -2178,6 +2185,8 @@ static void fuse_lk_fill(struct fuse_req *req, struct file *file,
>> > req->in.numargs = 1;
>> > req->in.args[0].size = sizeof(*arg);
>> > req->in.args[0].value = arg;
>> > +
>> > + return 0;
>> > }
>> >
>> > static int fuse_getlk(struct file *file, struct file_lock *fl)
>> > @@ -2192,16 +2201,19 @@ static int fuse_getlk(struct file *file, struct file_lock *fl)
>> > if (IS_ERR(req))
>> > return PTR_ERR(req);
>> >
>> > - fuse_lk_fill(req, file, fl, FUSE_GETLK, 0, 0);
>> > + err = fuse_lk_fill(req, file, fl, FUSE_GETLK, NULL, 0);
>> > + if (err)
>> > + goto out;
>> > req->out.numargs = 1;
>> > req->out.args[0].size = sizeof(outarg);
>> > req->out.args[0].value = &outarg;
>> > fuse_request_send(fc, req);
>> > err = req->out.h.error;
>> > - fuse_put_request(fc, req);
>> > if (!err)
>> > - err = convert_fuse_file_lock(&outarg.lk, fl);
>> > + err = convert_fuse_file_lock(fc, &outarg.lk, fl);
>> >
>> > +out:
>> > + fuse_put_request(fc, req);
>> > return err;
>> > }
>> >
>> > @@ -2211,7 +2223,7 @@ static int fuse_setlk(struct file *file, struct file_lock *fl, int flock)
>> > struct fuse_conn *fc = get_fuse_conn(inode);
>> > struct fuse_req *req;
>> > int opcode = (fl->fl_flags & FL_SLEEP) ? FUSE_SETLKW : FUSE_SETLK;
>> > - pid_t pid = fl->fl_type != F_UNLCK ? current->tgid : 0;
>> > + struct pid *pid = fl->fl_type != F_UNLCK ? task_tgid(current) : NULL;
>> > int err;
>> >
>> > if (fl->fl_lmops && fl->fl_lmops->lm_grant) {
>> > @@ -2227,12 +2239,16 @@ static int fuse_setlk(struct file *file, struct file_lock *fl, int flock)
>> > if (IS_ERR(req))
>> > return PTR_ERR(req);
>> >
>> > - fuse_lk_fill(req, file, fl, opcode, pid, flock);
>> > + err = fuse_lk_fill(req, file, fl, opcode, pid, flock);
>> > + if (err)
>> > + goto out;
>> > fuse_request_send(fc, req);
>> > err = req->out.h.error;
>> > /* locking is restartable */
>> > if (err == -EINTR)
>> > err = -ERESTARTSYS;
>> > +
>> > +out:
>> > fuse_put_request(fc, req);
>> > return err;
>> > }
>> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
>> > index e8e47a6ab518..a3ded071e2c6 100644
>> > --- a/fs/fuse/fuse_i.h
>> > +++ b/fs/fuse/fuse_i.h
>> > @@ -22,6 +22,7 @@
>> > #include <linux/rbtree.h>
>> > #include <linux/poll.h>
>> > #include <linux/workqueue.h>
>> > +#include <linux/pid_namespace.h>
>> >
>> > /** Max number of pages that can be used in a single read request */
>> > #define FUSE_MAX_PAGES_PER_REQ 32
>> > @@ -386,6 +387,9 @@ struct fuse_conn {
>> > /** The group id for this mount */
>> > kgid_t group_id;
>> >
>> > + /** The pid namespace for this mount */
>> > + struct pid_namespace *pid_ns;
>> > +
>> > /** The fuse mount flags for this mount */
>> > unsigned flags;
>> >
>> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
>> > index 03246cd9d47a..e137969815a3 100644
>> > --- a/fs/fuse/inode.c
>> > +++ b/fs/fuse/inode.c
>> > @@ -20,6 +20,7 @@
>> > #include <linux/random.h>
>> > #include <linux/sched.h>
>> > #include <linux/exportfs.h>
>> > +#include <linux/pid_namespace.h>
>> >
>> > MODULE_AUTHOR("Miklos Szeredi <miklos@xxxxxxxxxx>");
>> > MODULE_DESCRIPTION("Filesystem in Userspace");
>> > @@ -616,6 +617,7 @@ void fuse_conn_init(struct fuse_conn *fc)
>> > fc->initialized = 0;
>> > fc->attr_version = 1;
>> > get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
>> > + fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
>> > }
>> > EXPORT_SYMBOL_GPL(fuse_conn_init);
>> >
>> > @@ -624,6 +626,8 @@ void fuse_conn_put(struct fuse_conn *fc)
>> > if (atomic_dec_and_test(&fc->count)) {
>> > if (fc->destroy_req)
>> > fuse_request_free(fc->destroy_req);
>> > + put_pid_ns(fc->pid_ns);
>> > + fc->pid_ns = NULL;
>>
>> We don't usually zero out fields before freeing. There are debugging config
>> options to do that for us.
>
> Okay, I'll remove that line.
>
> Thanks,
> Seth
>
>>
>> > fc->release(fc);
>> > }
>> > }
>> > --
>> > 1.9.1
>> >
--
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/