Re: [PATCH v6] fuse: Add support for passthrough read/write
From: Jann Horn
Date: Wed Aug 19 2020 - 06:04:45 EST
On Tue, Aug 18, 2020 at 3:53 PM Alessio Balsini <balsini@xxxxxxxxxxx> wrote:
> Thank you both for the important feedback,
>
> I tried to consolidate all your suggestions in the new version of the
> patch, shared below.
>
> As you both recommended, that tricky ki_filp swapping has been removed,
> taking overlayfs as a reference for the management of asynchronous
> requests.
> The V7 below went again through some testing with fio (sync and libaio
> engines with several jobs in randrw), fsstress, and a bunch of kernel
> builds to trigger both the sync and async paths. Gladly everything worked
> as expected.
>
> What I didn't get from overlayfs are the temporary cred switchings, as I
> think that in the FUSE use case it is still the FUSE daemon that should
> take care of identifying if the requesting process has the right
> permissions before allowing the passthrough feature.
FYI, rw_verify_area() annoyingly calls security_file_permission(),
which is hooked by some stuff like SELinux. This doesn't really fit
well into the normal Linux security model, and it means that when a
userspace process tries to read from a FUSE file with passthrough
enabled, SELinux will actually perform access checks against *both*
the FUSE file itself and the backing file. That's kinda stupid, but
SELinux does it. If this ends up being a problem, switching
credentials would help. (Changing rw_verify_area() to not call
security_file_permission() would also help, but some of the LSM folks
might disagree with my opinion that that check is silly and should be
removed.)
But for now, I guess it might be fine to leave things as-is and not do
the extra credential switching.
> On Thu, Aug 13, 2020 at 08:30:21PM +0200, 'Jann Horn' via kernel-team wrote:
> > On Thu, Aug 13, 2020 at 3:28 PM Alessio Balsini <balsini@xxxxxxxxxxx> wrote:
> > [...]
> >
> > My point is that you can use this security issue to steal file
> > descriptors from processes that are _not_ supposed to act as FUSE
> > daemons. For example, let's say there is some setuid root executable
> > that opens /etc/shadow and then writes a message to stdout. If I
> > invoke this setuid root executable with stdout pointing to a FUSE
> > device fd, and I can arrange for the message written by the setuid
> > process to look like a FUSE message, I can trick the setuid process to
> > install its /etc/shadow fd as a passthrough fd on a FUSE file, and
> > then, through the FUSE filesystem, mess with /etc/shadow.
> >
> > Also, for context: While on Android, access to /dev/fuse is
> > restricted, desktop Linux systems allow unprivileged users to use
> > FUSE.
> > [...]
> > The new fuse_dev_ioctl() command would behave just like
> > fuse_dev_write(), except that the ioctl-based interface would permit
> > OPEN/CREATE replies with FOPEN_PASSTHROUGH, while the write()-based
> > interface would reject them.
> > [...]
> > I can't think of any checks that you could do there to make it safe,
> > because fundamentally what you have to figure out is _userspace's
> > intent_, and you can't figure out what that is if userspace just calls
> > write().
> > [...]
> > In this case, an error indicates that the userspace programmer made a
> > mistake. So even if the userspace programmer is not looking at kernel
> > logs, we should indicate to them that they messed up - and we can do
> > that by returning an error code from the syscall. So I think we should
> > ideally abort the operation in this case.
>
> I've been thinking about the security issue you mentioned, but I'm thinking
> that besides the extra Android restrictions, it shouldn't be that easy to
> play with /dev/fuse, also for a privileged process in Linux.
> In order for a process to successfully send commands to /dev/fuse, the
> command has to match a pending request coming from the FUSE filesystem,
> meaning that the same privileged process should have mounted the FUSE
> filesystem, actually becoming a FUSE daemon itself.
> Is this still an eventually exploitable scenario?
There are setuid root binaries that will print attacker-controlled
data to stdout, or stuff like that. An attacker could set up a normal
FUSE filesystem, read an OPEN request from the /dev/fuse file
descriptor, construct a reply that would install a passthrough fd (but
NOT actually send it), and then invoke a setuid root binary with the
/dev/fuse fd as stderr and with some arguments that trick it into
writing the attacker-constructed reply to stderr. For example, you can
make "su" write a string starting with attacker-controlled data like
this:
$ cat blah.c
#include <unistd.h>
#include <err.h>
int main(void) {
execlp("su", "ATTACKER-CONTROLLED STRING", "-BLAH", NULL);
err(1, "execlp");
}
$ gcc -o blah blah.c
$ ./blah
ATTACKER-CONTROLLED STRING: invalid option -- 'B'
Try 'ATTACKER-CONTROLLED STRING --help' for more information.
$
And yes, sure, I know, you can't write nullbytes with this specific
example; but in principle, you have to assume that a privileged victim
process might be writing arbitrary attacker-controlled data to
attacker-supplied file descriptors.
> > No, call_write_iter() doesn't do any of the required checking and
> > setup and notification steps, it just calls into the file's
> > ->write_iter handler:
> >
> > static inline ssize_t call_write_iter(struct file *file, struct kiocb *kio,
> > struct iov_iter *iter)
> > {
> > return file->f_op->write_iter(kio, iter);
> > }
> >
> > Did you confuse call_write_iter() with vfs_iocb_iter_write(), or
> > something like that?
>
> Ops, you are right, I was referring to vfs_iocb_iter_write(), that was
> actually part of my first design, but ended up simplifying in favor of
> call_write_iter().
>
> > Requests can complete asynchronously. That means call_write_iter() can
> > return more or less immediately, while the request is processed in the
> > background, and at some point, a callback is invoked. That's what that
> > -EIOCBQUEUED return value is about. In that case, the current code
> > will change ->ki_filp while the request is still being handled in the
> > background.
> >
> > I recommend looking at ovl_write_iter() in overlayfs to see how
> > they're dealing with this case.
>
> That was a great suggestion, I hopefully picked the right things from
> overlayfs, without overlooking some security aspects.
[...]
> diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
[...]
> +static void fuse_copyattr(struct file *dst_file, struct file *src_file,
> + bool write)
> +{
> + struct inode *dst = file_inode(dst_file);
> + struct inode *src = file_inode(src_file);
> +
> + fsstack_copy_attr_times(dst, src);
> + if (write)
> + fsstack_copy_inode_size(dst, src);
> +}
[...]
> +ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
> + struct iov_iter *iter)
> +{
> + struct file *fuse_filp = iocb_fuse->ki_filp;
> + struct fuse_file *ff = fuse_filp->private_data;
> + struct file *passthrough_filp = ff->passthrough_filp;
> + ssize_t ret;
> +
> + if (!iov_iter_count(iter))
> + return 0;
> +
> + if (is_sync_kiocb(iocb_fuse)) {
> + struct kiocb iocb;
> +
> + kiocb_clone(&iocb, iocb_fuse, passthrough_filp);
> + ret = call_read_iter(passthrough_filp, &iocb, iter);
> + iocb_fuse->ki_pos = iocb.ki_pos;
> + } else {
> + struct fuse_aio_req *aio_req;
> +
> + aio_req =
> + kmem_cache_zalloc(fuse_aio_request_cachep, GFP_KERNEL);
> + if (!aio_req)
> + return -ENOMEM;
> +
> + aio_req->iocb_fuse = iocb_fuse;
> + kiocb_clone(&aio_req->iocb, iocb_fuse, passthrough_filp);
> + aio_req->iocb.ki_complete = fuse_aio_rw_complete;
> + ret = vfs_iocb_iter_read(passthrough_filp, &aio_req->iocb,
> + iter);
> + if (ret != -EIOCBQUEUED)
> + fuse_aio_cleanup_handler(aio_req);
> + }
> +
> + fuse_copyattr(fuse_filp, passthrough_filp, false);
This copies timestamps from the passthrough_filp to the fuse_filp
without any locking I can see - which means that depending on the
architecture and the compiler's decisions, the copied timestamps can
end up getting corrupted to some degree. On 64-bit platforms, in
practice, you might e.g. end up with a timestamp where the tv_nsec
part is from an older timestamp while tv_sec is from a newer one,
yielding an overall timestamp that might e.g. be almost a second in
the future; and on 32-bit platforms, the results could even be much
worse. In theory, this is a data race, which should generally be
avoided.
The same thing applies to some of the other places where
fuse_copyattr() is used.
>From a higher-level perspective, are you sure that you even want to do
this timestamp-copying at all, given that the userspace fuse daemon
will also be supplying its own timestamps? Especially considering that
the timestamps are at the inode level while the passthrough logic is
at the file level, that seems pretty wonky.
> + return ret;
> +}
[...]
> +int fuse_passthrough_setup(struct fuse_conn *fc, struct fuse_req *req)
> +{
> + struct super_block *passthrough_sb;
> + struct inode *passthrough_inode;
> + struct fuse_open_out *open_out;
> + struct file *passthrough_filp;
> + unsigned short open_out_index;
> + int fs_stack_depth;
> +
> + req->passthrough_filp = NULL;
> +
> + if (!fc->passthrough)
> + return 0;
> +
> + if (!(req->in.h.opcode == FUSE_OPEN && req->args->out_numargs == 1) &&
> + !(req->in.h.opcode == FUSE_CREATE && req->args->out_numargs == 2))
> + return 0;
> +
> + open_out_index = req->args->out_numargs - 1;
> +
> + if (req->args->out_args[open_out_index].size != sizeof(*open_out))
> + return 0;
Can this ever legitimately happen, or would that indicate a kernel
bug? If it indicates a kernel bug, please write WARN_ON() around the
condition, like this: "if
(WARN_ON(req->args->out_args[open_out_index].size !=
sizeof(*open_out)))".
> + open_out = req->args->out_args[open_out_index].value;
> +
> + if (!(open_out->open_flags & FOPEN_PASSTHROUGH))
> + return 0;
> +
> + passthrough_filp = fget(open_out->fd);
> + if (!passthrough_filp) {
> + pr_err("FUSE: invalid file descriptor for passthrough.\n");
> + return -1;
> + }
> +
> + if (!passthrough_filp->f_op->read_iter ||
> + !passthrough_filp->f_op->write_iter) {
> + pr_err("FUSE: passthrough file misses file operations.\n");
> + fput(passthrough_filp);
> + return -1;
> + }
(Just as a non-actionable comment: This means that passthrough files
won't work with files that use ->read and ->write handlers instead of
->read_iter and ->write_iter. But that's probably actually a good
thing - ->read and ->write are mostly used by weird pseudo-files, and
I don't see any legitimate reason why anyone would want to use those
things with FUSE passthrough.)
> + passthrough_inode = file_inode(passthrough_filp);
> + passthrough_sb = passthrough_inode->i_sb;
> + fs_stack_depth = passthrough_sb->s_stack_depth + 1;
> + if (fs_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
> + pr_err("FUSE: maximum fs stacking depth exceeded for passthrough\n");
> + fput(passthrough_filp);
> + return -1;
> + }
> +
> + req->passthrough_filp = passthrough_filp;
> + return 0;
> +}
[...]
> +int __init fuse_passthrough_aio_request_cache_init(void)
> +{
> + fuse_aio_request_cachep =
> + kmem_cache_create("fuse_aio_req", sizeof(struct fuse_aio_req),
> + 0, SLAB_HWCACHE_ALIGN, NULL);
> + if (!fuse_aio_request_cachep)
> + return -ENOMEM;
> +
> + return 0;
> +}
> +
> +void fuse_passthrough_aio_request_cache_destroy(void)
> +{
> + kmem_cache_destroy(fuse_aio_request_cachep);
> +}
I think you don't have to make your own slab cache for this; it should
be fine to just use kmalloc() instead. Using a dedicated cache will be
more space-efficient if you have tons of allocations, but you won't
have a huge number of these allocations existing simultaneously, so
that shouldn't matter much. And if you only have a very small number
of allocations, you'll probably actually end up using more memory.