Re: [PATCH 08/10] nfsd: fix partial-write detection in nfsd_direct_write
From: Chuck Lever
Date: Fri May 29 2026 - 13:20:26 EST
On 5/29/26 1:06 PM, Jeff Layton wrote:
> On Fri, 2026-05-29 at 13:03 -0400, Chuck Lever wrote:
>> On 5/29/26 1:01 PM, Jeff Layton wrote:
>>> On Fri, 2026-05-29 at 12:57 -0400, Chuck Lever wrote:
>>>>
>>>> On Thu, May 28, 2026, at 5:55 PM, Jeff Layton wrote:
>>>>> From: Chris Mason <clm@xxxxxxxx>
>>>>>
>>>>> nfsd_direct_write() walks a list of write segments and, after each
>>>>> vfs_iocb_iter_write(), tries to detect a short write so the loop can
>>>>> stop before placing the next segment at a wrong file offset:
>>>>>
>>>>> host_err = vfs_iocb_iter_write(file, kiocb, &segments[i].iter);
>>>>> if (host_err < 0)
>>>>> return host_err;
>>>>> *cnt += host_err;
>>>>> if (host_err < segments[i].iter.count)
>>>>> break; /* partial write */
>>>>>
>>>>> vfs_iocb_iter_write() runs the iter through ->write_iter(), which
>>>>> advances the iter by the number of bytes written. By the time the
>>>>> check runs, segments[i].iter.count is the residual, not the original
>>>>> request length:
>>>>>
>>>>> before write_iter: iter.count == original_len
>>>>> after write_iter: iter.count == original_len - host_err
>>>>>
>>>>> The condition then reduces to host_err < original_len - host_err, so
>>>>> the break fires only when less than half of the segment was written.
>>>>> Any short write completing between 50% and 99% of the segment slips
>>>>> through; the loop advances to the next segment with kiocb->ki_pos
>>>>> only bumped by the short amount, writing the next segment's payload
>>>>> at the wrong offset and over-reporting *cnt to the NFS client.
>>>>>
>>>>> Snapshot the segment's byte count before the write and compare
>>>>> host_err against that snapshot so any short write breaks the loop.
>>>>>
>>>>> Fixes: 06c5c97293e3 ("NFSD: Implement NFSD_IO_DIRECT for NFS WRITE")
>>>>> Assisted-by: kres:claude-opus-4-7
>>>>> Signed-off-by: Chris Mason <clm@xxxxxxxx>
>>>>> ---
>>>>> fs/nfsd/vfs.c | 5 ++++-
>>>>> 1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
>>>>> index 980217f755b7..619f252af4d1 100644
>>>>> --- a/fs/nfsd/vfs.c
>>>>> +++ b/fs/nfsd/vfs.c
>>>>> @@ -1380,6 +1380,7 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct
>>>>> svc_fh *fhp,
>>>>> struct file *file = nf->nf_file;
>>>>> unsigned int nsegs, i;
>>>>> ssize_t host_err;
>>>>> + size_t expected;
>>>>>
>>>>> nsegs = nfsd_write_dio_iters_init(nf, rqstp->rq_bvec, nvecs,
>>>>> kiocb, *cnt, segments);
>>>>> @@ -1401,11 +1402,13 @@ nfsd_direct_write(struct svc_rqst *rqstp,
>>>>> struct svc_fh *fhp,
>>>>> kiocb->ki_flags |= IOCB_DONTCACHE;
>>>>> }
>>>>>
>>>>> + expected = iov_iter_count(&segments[i].iter);
>>>>> +
>>>>> host_err = vfs_iocb_iter_write(file, kiocb, &segments[i].iter);
>>>>> if (host_err < 0)
>>>>> return host_err;
>>>>> *cnt += host_err;
>>>>> - if (host_err < segments[i].iter.count)
>>>>> + if (host_err < (ssize_t)expected)
>>>>> break; /* partial write */
>>>>> }
>>>>>
>>>>>
>>>>> --
>>>>> 2.54.0
>>>>
>>>> How many filesystems can return a short write in this case?
>>>> My impression was that only the NFS client can do that.
>>>>
>>>
>>> No idea right offhand, but NFS is exportable. Since
>>> vfs_iocb_iter_write() is allowed to return a short write, I think we
>>> have to deal with that properly here.
>>
>> NFSD_IO_DIRECT is experimental, and doesn't make sense (to me)
>> to use with an NFS re-export.
>>
>> If we can find another filesystem that might return a short write
>> with NFSD_IO_DIRECT, I might consider this a higher priority.
>>
>
> I'm fairly sure Ceph and CIFS can and they're exportable. For local
> filesystems, I'm not even sure how to audit that.
>
> Given that the potential effect is data corruption, omitting this patch
> based on guesswork about what filesystems are being exported seems
> unwise.
The guesswork for me is that it's very unclear how likely a short write
result might be and under what circumstances.
In any event, I think this one needs testing to confirm that the logic
change doesn't introduce a regression in known working use cases.
--
Chuck Lever