Re: [PATCH] NFS: invalidate i_blocks after COMMIT to fix stale block count on NFSv4

From: Trond Myklebust

Date: Tue Jun 23 2026 - 00:51:40 EST


On Tue, 2026-06-23 at 12:04 +0800, Jingbo Xu wrote:

> +cc [linux-xfs@xxxxxxxxxxxxxxx](mailto:linux-xfs@xxxxxxxxxxxxxxx)
>
> On 6/22/26 9:56 PM, Trond Myklebust wrote:
>
> > On Mon, 2026-06-22 at 14:00 +0800, Jingbo Xu wrote:
> >
> > > NFSv4 COMMIT compound does not include GETATTR, and
> > > nfs4_commit_done_cb
> > > does not refresh inode attributes. Meanwhile, every WRITE marks
> > > NFS_INO_INVALID_BLOCKS via nfs_post_op_update_inode_force_wcc_locked.
> > >
> > > After COMMIT, i_blocks remains stale until the next stat() triggers a
> > > full revalidation. In writeback-heavy workloads where COMMITs happen
> > > without intervening stat() calls, the cached block count can stay
> > > indefinitely wrong.
> > >
> > > Mark NFS_INO_INVALID_BLOCKS on successful COMMIT completion so that
> > > the
> > > next nfs_getattr requesting STATX_BLOCKS will issue a GETATTR with
> > > SPACE_USED, fetching the correct value from the server.
> > >
> > > This matches NFSv3 behavior where nfs3_commit_done already calls
> > > nfs_refresh_inode with the wcc_data post-op attributes.
> > >
> > > Reproduce with xfstests generic/694 on NFSv4.0 loopback:
> > >
> > >   Server:
> > >     mount /dev/vdc /data/test
> > >     mount /dev/vdd /data/scratch
> > >     exportfs -o insecure,rw,sync,no_root_squash,fsid=1
> > > 127.0.0.1:/data/test
> > >     exportfs -o insecure,rw,sync,no_root_squash,fsid=2
> > > 127.0.0.1:/data/scratch
> > >
> > >   Client:
> > >     mount -t nfs -o vers=4.0 localhost:/data/test /mnt/test
> > >     mount -t nfs -o vers=4.0 localhost:/data/scratch /mnt/scratch
> > >
> > >   local.config:
> > >     export TEST_FS_MOUNT_OPTS="-o vers=4.0"
> > >     export MOUNT_OPTIONS="-o vers=4.0"
> > >     export FSTYP=nfs
> > >     export TEST_DEV=localhost:/data/test
> > >     export SCRATCH_DEV=localhost:/data/scratch
> > >     export TEST_DIR=/mnt/test
> > >     export SCRATCH_MNT=/mnt/scratch
> > >
> > > This fixes xfstests generic/694.
> > >
> > > Assisted-by: Qoder:Qwen3.7-Max
> > > Signed-off-by: Jingbo Xu <[jefflexu@xxxxxxxxxxxxxxxxx](mailto:jefflexu@xxxxxxxxxxxxxxxxx)>
> > > ---
> > >  fs/nfs/write.c | 2 ++
> > >  1 file changed, 2 insertions(+)
> > >
> > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > > index d7c399763ad9..88c5c9f7434c 100644
> > > --- a/fs/nfs/write.c
> > > +++ b/fs/nfs/write.c
> > > @@ -1851,6 +1851,8 @@ static void nfs_commit_release_pages(struct
> > > nfs_commit_data *data)
> > >   /* Latency breaker */
> > >   cond_resched();
> > >   }
> > > + if (status >= 0)
> > > + nfs_set_cache_invalid(data->inode,
> > > NFS_INO_INVALID_BLOCKS);
> > >  
> > >   nfs_init_cinfo(&cinfo, data->inode, data->dreq);
> > >   nfs_commit_end(cinfo.mds);
> >
> >
> > That sounds like an XFS bug, not an NFS bug. COMMIT isn't changing the
> > data contents of the file: it is just ensuring that the existing data
> > is persisted onto disk.
> >
>
>
> Yes the underlying backend filesystem is indeed xfs.
>
> I retest it and can confirm that this failure is much likely
> reproducible on NFS upon XFS, while NFS upon EXT4 succeeds for 50
> consecutive test runs.
>
> Beside XFS itself also passes the test.
>
>
> To be honest I'm not familiar with NFS, following is what AI concludes:
>
> ```
> Root Cause Timing Sequence: NFSv4.0 Stale i_blocks After syncfs
>
>
>    Client issues multiple UNSTABLE WRITEs — each WRITE compound includes
> a piggy-backed GETATTR that returns the server's current SPACE_USED. The
> client updates inode->i_blocks from the last completed WRITE's post-op
> attributes.
>
>    Server-side delayed allocation — the export's local filesystem
> (ext4/xfs) uses delayed allocation. Metadata blocks (indirect blocks /
> extent tree nodes) are not yet allocated during most WRITE handling;
> they materialize only when the local fs performs its own writeback.
>
>    Last WRITE completes before server metadata writeback (~80%
> probability in user's environment) — the final GETATTR returns
> SPACE_USED = 8388608 (data only, no metadata block). Client caches
> i_blocks = 8388608.
>
>    syncfs triggers COMMIT — nfs_write_inode(WB_SYNC_ALL) calls
> __nfs_commit_inode, which sends a COMMIT RPC to the server.
>    Server executes vfs_fsync_range — this forces the local fs writeback,
> which now allocates the metadata block. Server's true SPACE_USED becomes
> 8388616 (+8 sectors = 4 KiB).
>
>    COMMIT response carries no file attributes — per RFC 7530 §16.3.3,
> COMMIT4resok contains only a write verifier. The XDR decoder
> (nfs4_xdr_dec_commit) never calls decode_getfattr.
>
>    nfs_commit_release_pages performs no attribute invalidation — it
> neither sends a follow-up GETATTR nor marks NFS_INO_INVALID_BLOCKS. The
> stale cached value persists.
>
>    Subsequent stat returns stale i_blocks — cache_validity is clean, so
> nfs_getattr serves the cached value 8388608 without revalidation.
>
>    After cycle_mount, fresh GETATTR returns 8388616 — mismatch detected,
> test fails.
> ```
>


I agree with your AI that this may indeed be a consequence of XFS's delayed allocation feature. However that hardly changes the fact that it would be a violation of the POSIX rules for how write(), fsync() and stat() are supposed to work in this situation.

My point is that I fail to see why we should degrade the performance of the generic NFS client in order to address a bug in one of the back end filesystems (in this case XFS) being exported.

So strong NACK to this patch from me.

```
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trondmy@xxxxxxxxxx, trond.myklebust@xxxxxxxxxxxxxxx

```