Re: [PATCH] NFS: invalidate i_blocks after COMMIT to fix stale block count on NFSv4

From: Jingbo Xu

Date: Tue Jun 23 2026 - 00:56:24 EST




On 6/23/26 12:51 PM, Trond Myklebust wrote:
> On Tue, 2026-06-23 at 12:04 +0800, Jingbo Xu wrote:
>
>> +cc [linux-xfs@xxxxxxxxxxxxxxx](mailto:linux-xfs@xxxxxxxxxxxxxxx)
>>
>> On 6/22/26 9:56 PM, Trond Myklebust wrote:
>>
>>> On Mon, 2026-06-22 at 14:00 +0800, Jingbo Xu wrote:
>>>
>>>> NFSv4 COMMIT compound does not include GETATTR, and
>>>> nfs4_commit_done_cb
>>>> does not refresh inode attributes. Meanwhile, every WRITE marks
>>>> NFS_INO_INVALID_BLOCKS via nfs_post_op_update_inode_force_wcc_locked.
>>>>
>>>> After COMMIT, i_blocks remains stale until the next stat() triggers a
>>>> full revalidation. In writeback-heavy workloads where COMMITs happen
>>>> without intervening stat() calls, the cached block count can stay
>>>> indefinitely wrong.
>>>>
>>>> Mark NFS_INO_INVALID_BLOCKS on successful COMMIT completion so that
>>>> the
>>>> next nfs_getattr requesting STATX_BLOCKS will issue a GETATTR with
>>>> SPACE_USED, fetching the correct value from the server.
>>>>
>>>> This matches NFSv3 behavior where nfs3_commit_done already calls
>>>> nfs_refresh_inode with the wcc_data post-op attributes.
>>>>
>>>> Reproduce with xfstests generic/694 on NFSv4.0 loopback:
>>>>
>>>>   Server:
>>>>     mount /dev/vdc /data/test
>>>>     mount /dev/vdd /data/scratch
>>>>     exportfs -o insecure,rw,sync,no_root_squash,fsid=1
>>>> 127.0.0.1:/data/test
>>>>     exportfs -o insecure,rw,sync,no_root_squash,fsid=2
>>>> 127.0.0.1:/data/scratch
>>>>
>>>>   Client:
>>>>     mount -t nfs -o vers=4.0 localhost:/data/test /mnt/test
>>>>     mount -t nfs -o vers=4.0 localhost:/data/scratch /mnt/scratch
>>>>
>>>>   local.config:
>>>>     export TEST_FS_MOUNT_OPTS="-o vers=4.0"
>>>>     export MOUNT_OPTIONS="-o vers=4.0"
>>>>     export FSTYP=nfs
>>>>     export TEST_DEV=localhost:/data/test
>>>>     export SCRATCH_DEV=localhost:/data/scratch
>>>>     export TEST_DIR=/mnt/test
>>>>     export SCRATCH_MNT=/mnt/scratch
>>>>
>>>> This fixes xfstests generic/694.
>>>>
>>>> Assisted-by: Qoder:Qwen3.7-Max
>>>> Signed-off-by: Jingbo Xu <[jefflexu@xxxxxxxxxxxxxxxxx](mailto:jefflexu@xxxxxxxxxxxxxxxxx)>
>>>> ---
>>>>  fs/nfs/write.c | 2 ++
>>>>  1 file changed, 2 insertions(+)
>>>>
>>>> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
>>>> index d7c399763ad9..88c5c9f7434c 100644
>>>> --- a/fs/nfs/write.c
>>>> +++ b/fs/nfs/write.c
>>>> @@ -1851,6 +1851,8 @@ static void nfs_commit_release_pages(struct
>>>> nfs_commit_data *data)
>>>>   /* Latency breaker */
>>>>   cond_resched();
>>>>   }
>>>> + if (status >= 0)
>>>> + nfs_set_cache_invalid(data->inode,
>>>> NFS_INO_INVALID_BLOCKS);
>>>>  
>>>>   nfs_init_cinfo(&cinfo, data->inode, data->dreq);
>>>>   nfs_commit_end(cinfo.mds);
>>>
>>>
>>> That sounds like an XFS bug, not an NFS bug. COMMIT isn't changing the
>>> data contents of the file: it is just ensuring that the existing data
>>> is persisted onto disk.
>>>
>>
>>
>> Yes the underlying backend filesystem is indeed xfs.
>>
>> I retest it and can confirm that this failure is much likely
>> reproducible on NFS upon XFS, while NFS upon EXT4 succeeds for 50
>> consecutive test runs.
>>
>> Beside XFS itself also passes the test.
>>
>>
>> To be honest I'm not familiar with NFS, following is what AI concludes:
>>
>> ```
>> Root Cause Timing Sequence: NFSv4.0 Stale i_blocks After syncfs
>>
>>
>>    Client issues multiple UNSTABLE WRITEs — each WRITE compound includes
>> a piggy-backed GETATTR that returns the server's current SPACE_USED. The
>> client updates inode->i_blocks from the last completed WRITE's post-op
>> attributes.
>>
>>    Server-side delayed allocation — the export's local filesystem
>> (ext4/xfs) uses delayed allocation. Metadata blocks (indirect blocks /
>> extent tree nodes) are not yet allocated during most WRITE handling;
>> they materialize only when the local fs performs its own writeback.
>>
>>    Last WRITE completes before server metadata writeback (~80%
>> probability in user's environment) — the final GETATTR returns
>> SPACE_USED = 8388608 (data only, no metadata block). Client caches
>> i_blocks = 8388608.
>>
>>    syncfs triggers COMMIT — nfs_write_inode(WB_SYNC_ALL) calls
>> __nfs_commit_inode, which sends a COMMIT RPC to the server.
>>    Server executes vfs_fsync_range — this forces the local fs writeback,
>> which now allocates the metadata block. Server's true SPACE_USED becomes
>> 8388616 (+8 sectors = 4 KiB).
>>
>>    COMMIT response carries no file attributes — per RFC 7530 §16.3.3,
>> COMMIT4resok contains only a write verifier. The XDR decoder
>> (nfs4_xdr_dec_commit) never calls decode_getfattr.
>>
>>    nfs_commit_release_pages performs no attribute invalidation — it
>> neither sends a follow-up GETATTR nor marks NFS_INO_INVALID_BLOCKS. The
>> stale cached value persists.
>>
>>    Subsequent stat returns stale i_blocks — cache_validity is clean, so
>> nfs_getattr serves the cached value 8388608 without revalidation.
>>
>>    After cycle_mount, fresh GETATTR returns 8388616 — mismatch detected,
>> test fails.
>> ```
>>
>
>
> I agree with your AI that this may indeed be a consequence of XFS's delayed allocation feature. However that hardly changes the fact that it would be a violation of the POSIX rules for how write(), fsync() and stat() are supposed to work in this situation.
>
> My point is that I fail to see why we should degrade the performance of the generic NFS client in order to address a bug in one of the back end filesystems (in this case XFS) being exported.
>

Understand. Just have no idea which is the proper way to fix this.


--
Thanks,
Jingbo