Re: [PATCH 08/10] nfsd: fix partial-write detection in nfsd_direct_write

From: Jeff Layton

Date: Fri May 29 2026 - 13:09:02 EST

On Fri, 2026-05-29 at 13:03 -0400, Chuck Lever wrote:
> On 5/29/26 1:01 PM, Jeff Layton wrote:
> > On Fri, 2026-05-29 at 12:57 -0400, Chuck Lever wrote:
> > >
> > > On Thu, May 28, 2026, at 5:55 PM, Jeff Layton wrote:
> > > > From: Chris Mason <clm@xxxxxxxx>
> > > >
> > > > nfsd_direct_write() walks a list of write segments and, after each
> > > > vfs_iocb_iter_write(), tries to detect a short write so the loop can
> > > > stop before placing the next segment at a wrong file offset:
> > > >
> > > > host_err = vfs_iocb_iter_write(file, kiocb, &segments[i].iter);
> > > > if (host_err < 0)
> > > > return host_err;
> > > > *cnt += host_err;
> > > > if (host_err < segments[i].iter.count)
> > > > break; /* partial write */
> > > >
> > > > vfs_iocb_iter_write() runs the iter through ->write_iter(), which
> > > > advances the iter by the number of bytes written. By the time the
> > > > check runs, segments[i].iter.count is the residual, not the original
> > > > request length:
> > > >
> > > > before write_iter: iter.count == original_len
> > > > after write_iter: iter.count == original_len - host_err
> > > >
> > > > The condition then reduces to host_err < original_len - host_err, so
> > > > the break fires only when less than half of the segment was written.
> > > > Any short write completing between 50% and 99% of the segment slips
> > > > through; the loop advances to the next segment with kiocb->ki_pos
> > > > only bumped by the short amount, writing the next segment's payload
> > > > at the wrong offset and over-reporting *cnt to the NFS client.
> > > >
> > > > Snapshot the segment's byte count before the write and compare
> > > > host_err against that snapshot so any short write breaks the loop.
> > > >
> > > > Fixes: 06c5c97293e3 ("NFSD: Implement NFSD_IO_DIRECT for NFS WRITE")
> > > > Assisted-by: kres:claude-opus-4-7
> > > > Signed-off-by: Chris Mason <clm@xxxxxxxx>
> > > > ---
> > > > fs/nfsd/vfs.c | 5 ++++-
> > > > 1 file changed, 4 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > > > index 980217f755b7..619f252af4d1 100644
> > > > --- a/fs/nfsd/vfs.c
> > > > +++ b/fs/nfsd/vfs.c
> > > > @@ -1380,6 +1380,7 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct
> > > > svc_fh *fhp,
> > > > struct file *file = nf->nf_file;
> > > > unsigned int nsegs, i;
> > > > ssize_t host_err;
> > > > + size_t expected;
> > > >
> > > > nsegs = nfsd_write_dio_iters_init(nf, rqstp->rq_bvec, nvecs,
> > > > kiocb, *cnt, segments);
> > > > @@ -1401,11 +1402,13 @@ nfsd_direct_write(struct svc_rqst *rqstp,
> > > > struct svc_fh *fhp,
> > > > kiocb->ki_flags |= IOCB_DONTCACHE;
> > > > }
> > > >
> > > > + expected = iov_iter_count(&segments[i].iter);
> > > > +
> > > > host_err = vfs_iocb_iter_write(file, kiocb, &segments[i].iter);
> > > > if (host_err < 0)
> > > > return host_err;
> > > > *cnt += host_err;
> > > > - if (host_err < segments[i].iter.count)
> > > > + if (host_err < (ssize_t)expected)
> > > > break; /* partial write */
> > > > }
> > > >
> > > >
> > > > --
> > > > 2.54.0
> > >
> > > How many filesystems can return a short write in this case?
> > > My impression was that only the NFS client can do that.
> > >
> >
> > No idea right offhand, but NFS is exportable. Since
> > vfs_iocb_iter_write() is allowed to return a short write, I think we
> > have to deal with that properly here.
>
> NFSD_IO_DIRECT is experimental, and doesn't make sense (to me)
> to use with an NFS re-export.
>
> If we can find another filesystem that might return a short write
> with NFSD_IO_DIRECT, I might consider this a higher priority.
>

I'm fairly sure Ceph and CIFS can and they're exportable. For local
filesystems, I'm not even sure how to audit that.

Given that the potential effect is data corruption, omitting this patch
based on guesswork about what filesystems are being exported seems
unwise.
--
Jeff Layton <jlayton@xxxxxxxxxx>