Re: [PATCH v4 5/5] virtiofs: propagate sync() to file server

From: Vivek Goyal
Date: Mon Aug 30 2021 - 13:01:34 EST


On Sat, Aug 28, 2021 at 05:21:39PM +0200, Miklos Szeredi wrote:
> On Mon, Aug 16, 2021 at 11:29:02AM -0400, Vivek Goyal wrote:
> > On Sun, Aug 15, 2021 at 05:14:06PM +0300, Amir Goldstein wrote:
>
> > > I wonder - even if the server does not support SYNCFS or if the kernel
> > > does not trust the server with SYNCFS, fuse_sync_fs() can wait
> > > until all pending requests up to this call have been completed, either
> > > before or after submitting the SYNCFS request. No?
> >
> > >
> > > Does virtiofsd track all requests prior to SYNCFS request to make
> > > sure that they were executed on the host filesystem before calling
> > > syncfs() on the host filesystem?
> >
> > Hi Amir,
> >
> > I don't think virtiofsd has any such notion. I would think, that
> > client should make sure all pending writes have completed and
> > then send SYNCFS request.
> >
> > Looking at the sync_filesystem(), I am assuming vfs will take care
> > of flushing out all dirty pages and then call ->sync_fs.
> >
> > Having said that, I think fuse queues the writeback request internally
> > and signals completion of writeback to mm(end_page_writeback()). And
> > that's why fuse_fsync() has notion of waiting for all pending
> > writes to finish on an inode (fuse_sync_writes()).
> >
> > So I think you have raised a good point. That is if there are pending
> > writes at the time of syncfs(), we don't seem to have a notion of
> > first waiting for all these writes to finish before we send
> > FUSE_SYNCFS request to server.
>
> So here a proposed patch for fixing this. Works by counting write requests
> initiated up till the syncfs call. Since more than one syncfs can be in
> progress counts are kept in "buckets" in order to wait for the correct write
> requests in each instance.
>
> I tried to make this lightweight, but the cacheline bounce due to the counter is
> still there, unfortunately. fc->num_waiting also causes cacheline bouce, so I'm
> not going to optimize this (percpu counter?) until that one is also optimizied.
>
> Not yet tested, and I'm not sure how to test this.
>
> Comments?
>
> Thanks,
> Miklos
>
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 97f860cfc195..8d1d6e895534 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -389,6 +389,7 @@ struct fuse_writepage_args {
> struct list_head queue_entry;
> struct fuse_writepage_args *next;
> struct inode *inode;
> + struct fuse_sync_bucket *bucket;
> };
>
> static struct fuse_writepage_args *fuse_find_writeback(struct fuse_inode *fi,
> @@ -1608,6 +1609,9 @@ static void fuse_writepage_free(struct fuse_writepage_args *wpa)
> struct fuse_args_pages *ap = &wpa->ia.ap;
> int i;
>
> + if (wpa->bucket && atomic_dec_and_test(&wpa->bucket->num_writepages))

Hi Miklos,

Wondering why this wpa->bucket check is there. Isn't every wpa is associated
bucket. So when do we run into situation when wpa->bucket = NULL.


> + wake_up(&wpa->bucket->waitq);
> +
> for (i = 0; i < ap->num_pages; i++)
> __free_page(ap->pages[i]);
>
> @@ -1871,6 +1875,19 @@ static struct fuse_writepage_args *fuse_writepage_args_alloc(void)
>
> }
>
> +static void fuse_writepage_add_to_bucket(struct fuse_conn *fc,
> + struct fuse_writepage_args *wpa)
> +{
> + if (!fc->sync_fs)
> + return;
> +
> + rcu_read_lock();
> + do {
> + wpa->bucket = rcu_dereference(fc->curr_bucket);
> + } while (unlikely(!atomic_inc_not_zero(&wpa->bucket->num_writepages)));

So this loop is there because fuse_sync_fs() might be replacing
fc->curr_bucket. And we are fetching this pointer under rcu. So it is
possible that fuse_fs_sync() dropped its reference and that led to
->num_writepages 0 and we don't want to use this bucket.

What if fuse_sync_fs() dropped its reference but still there is another
wpa in progress and hence ->num_writepages is not zero. We still don't
want to use this bucket for new wpa, right?

> + rcu_read_unlock();
> +}
> +
> static int fuse_writepage_locked(struct page *page)
> {
> struct address_space *mapping = page->mapping;
> @@ -1898,6 +1915,7 @@ static int fuse_writepage_locked(struct page *page)
> if (!wpa->ia.ff)
> goto err_nofile;
>
> + fuse_writepage_add_to_bucket(fc, wpa);
> fuse_write_args_fill(&wpa->ia, wpa->ia.ff, page_offset(page), 0);
>
> copy_highpage(tmp_page, page);
> @@ -2148,6 +2166,8 @@ static int fuse_writepages_fill(struct page *page,
> __free_page(tmp_page);
> goto out_unlock;
> }
> + fuse_writepage_add_to_bucket(fc, wpa);
> +
> data->max_pages = 1;
>
> ap = &wpa->ia.ap;
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 07829ce78695..ee638e227bb3 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -515,6 +515,14 @@ struct fuse_fs_context {
> void **fudptr;
> };
>
> +struct fuse_sync_bucket {
> + atomic_t num_writepages;
> + union {
> + wait_queue_head_t waitq;
> + struct rcu_head rcu;
> + };
> +};
> +
> /**
> * A Fuse connection.
> *
> @@ -807,6 +815,9 @@ struct fuse_conn {
>
> /** List of filesystems using this connection */
> struct list_head mounts;
> +
> + /* New writepages go into this bucket */
> + struct fuse_sync_bucket *curr_bucket;
> };
>
> /*
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index b9beb39a4a18..524b2d128985 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -506,10 +506,24 @@ static int fuse_statfs(struct dentry *dentry, struct kstatfs *buf)
> return err;
> }
>
> +static struct fuse_sync_bucket *fuse_sync_bucket_alloc(void)
> +{
> + struct fuse_sync_bucket *bucket;
> +
> + bucket = kzalloc(sizeof(*bucket), GFP_KERNEL | __GFP_NOFAIL);
> + if (bucket) {
> + init_waitqueue_head(&bucket->waitq);
> + /* Initial active count */
> + atomic_set(&bucket->num_writepages, 1);
> + }
> + return bucket;
> +}
> +
> static int fuse_sync_fs(struct super_block *sb, int wait)
> {
> struct fuse_mount *fm = get_fuse_mount_super(sb);
> struct fuse_conn *fc = fm->fc;
> + struct fuse_sync_bucket *bucket, *new_bucket;
> struct fuse_syncfs_in inarg;
> FUSE_ARGS(args);
> int err;
> @@ -528,6 +542,31 @@ static int fuse_sync_fs(struct super_block *sb, int wait)
> if (!fc->sync_fs)
> return 0;
>
> + new_bucket = fuse_sync_bucket_alloc();
> + spin_lock(&fc->lock);
> + bucket = fc->curr_bucket;
> + if (atomic_read(&bucket->num_writepages) != 0) {
> + /* One more for count completion of old bucket */
> + atomic_inc(&new_bucket->num_writepages);
> + rcu_assign_pointer(fc->curr_bucket, new_bucket);
> + /* Drop initially added active count */
> + atomic_dec(&bucket->num_writepages);
> + spin_unlock(&fc->lock);
> +
> + wait_event(bucket->waitq, atomic_read(&bucket->num_writepages) == 0);
> + /*
> + * Drop count on new bucket, possibly resulting in a completion
> + * if more than one syncfs is going on
> + */
> + if (atomic_dec_and_test(&new_bucket->num_writepages))
> + wake_up(&new_bucket->waitq);
> + kfree_rcu(bucket, rcu);
> + } else {
> + spin_unlock(&fc->lock);
> + /* Free unused */
> + kfree(new_bucket);
When can we run into the situation when fc->curr_bucket is num_writepages
== 0. When install a bucket it has count 1. And only time it can go to
0 is when we have dropped the initial reference. And initial reference
can be dropped only after removing bucket from fc->curr_bucket.

IOW, we don't drop initial reference on a bucket if it is in
fc->curr_bucket. And that mean anything installed fc->curr_bucket should
not ever have a reference count of 0. What am I missing.

Thanks
Vivek

> + }

> +
> memset(&inarg, 0, sizeof(inarg));
> args.in_numargs = 1;
> args.in_args[0].size = sizeof(inarg);
> @@ -770,6 +809,7 @@ void fuse_conn_put(struct fuse_conn *fc)
> fiq->ops->release(fiq);
> put_pid_ns(fc->pid_ns);
> put_user_ns(fc->user_ns);
> + kfree_rcu(fc->curr_bucket, rcu);
> fc->release(fc);
> }
> }
> @@ -1418,6 +1458,7 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
> if (sb->s_flags & SB_MANDLOCK)
> goto err;
>
> + fc->curr_bucket = fuse_sync_bucket_alloc();
> fuse_sb_defaults(sb);
>
> if (ctx->is_bdev) {
>