Re: [PATCH RFC 1/3] vfs: add copy_file_range syscall and vfs helper

From: Jeff Layton
Date: Sat Apr 11 2015 - 09:04:17 EST

On Fri, 10 Apr 2015 20:24:06 -0400
Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> wrote:

> On Fri, Apr 10, 2015 at 8:02 PM, Zach Brown <zab@xxxxxxxxxx> wrote:
> > On Fri, Apr 10, 2015 at 06:36:41PM -0400, Trond Myklebust wrote:
> >> On Fri, Apr 10, 2015 at 6:00 PM, Zach Brown <zab@xxxxxxxxxx> wrote:
> >
> >> > +
> >> > +/*
> >> > + * copy_file_range() differs from regular file read and write in that it
> >> > + * specifically allows return partial success. When it does so is up to
> >> > + * the copy_file_range method.
> >> > + */
> >> > +ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> >> > + struct file *file_out, loff_t pos_out,
> >> > + size_t len, int flags)
> >>
> >> I'm going to repeat a gripe with this interface. I really don't think
> >> we should treat copy_file_range() as taking a size_t length, since
> >> that is not sufficient to do a full file copy on 32-bit systems w/ LFS
> >> support.
> >
> > *nod*. The length type is limited by the syscall return type and the
> > arbitrary desire to mimic read/write.
> >
> > I sympathize with wanting to copy giant files with operations that don't
> > scale with file size because files can be enormous but sparse.
> The other argument against using a size_t is that there is no memory
> buffer involved here. size_t is, after all, a type describing
> in-memory objects, not files.
> >> Could we perhaps instead of a length, define a 'pos_in_start' and a
> >> 'pos_in_end' offset (with the latter being -1 for a full-file copy)
> >> and then return an 'loff_t' value stating where the copy ended?
> >
> > Well, the resulting offset will be set if the caller provided it. So
> > they could already be getting the copied length from that. But they
> > might not specify the offsets. Maybe they're just using the results to
> > total up a completion indicator.
> >
> > Maybe we could make the length a pointer like the offsets that's set to
> > the copied length on return.
> That works, but why do we care so much about the difference between a
> length and an offset as a return value?

I think it just comes down to potential confusion for users. What's
more useful, the number of bytes actually copied, or the offset into the
file where the copy ended?

I tend to the think an offset is more useful for someone trying to
copy a file in chunks, particularly if the file is sparse. That gives
them a clear place to continue the copy.

So, I think I agree with Trond that phrasing this interface in terms of
file offsets seems like it might be more useful. That also neatly
sidesteps the size_t limitations on 32-bit platforms.

> To be fair, the NFS copy offload also allows the copy to proceed out
> of order, in which case the range of copied data could be
> non-contiguous in the case of a failure. However neither the length
> nor the offset case will give you the full story in that case. Any
> return value can at best be considered to define an offset range whose
> contents need to be checked for success/failure.

Yuck! How the heck do you clean up the mess if that happens? I guess
you're just stuck redoing the copy with normal READ/WRITE?

Maybe we need to have the interface return a hard error in that
case and not try to give back any sort of offset?

> > This all seems pretty gross. Does anyone else have a vote?
> >
> > (And I'll argue strongly against creating magical offset values that
> > change behaviour. If we want to ignore arguments and get the length
> > from the source file we'd add a flag to do so.)
> The '-1' was not intended to be a special/magical value: as far as I'm
> concerned any end offset that covers the full range of supported file
> lengths would be OK.

Agreed. A "whole file" flag might also be useful too, but I'd leave
that for after the initial implementation is merged, just in the
interest of having _something_ that works in the near term.

> >> Note that both btrfs and NFSv4.2 allow for 64-bit lengths, so this
> >> interface would be closer to what is already in use anyway.
> >
> > Yeah, btrfs doesn't allow partial progress. It returns 0 on success.
> > We could also do that but people have expressed an interest in returning
> > partial progress.
> Returning an end offset would satisfy the partial progress requirement
> (with the caveat mentioned above).

Jeff Layton <jlayton@xxxxxxxxxxxxxxx>
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at