Re: [PATCH v2] vfs: prevent copy_file_range to copy across devices

From: Luis Henriques
Date: Tue Feb 16 2021 - 07:05:20 EST


"gregkh@xxxxxxxxxxxxxxxxxxx" <gregkh@xxxxxxxxxxxxxxxxxxx> writes:

> On Tue, Feb 16, 2021 at 11:17:34AM +0000, Luis Henriques wrote:
>> Amir Goldstein <amir73il@xxxxxxxxx> writes:
>>
>> > On Mon, Feb 15, 2021 at 8:57 PM Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
>> >>
>> >> On Mon, 2021-02-15 at 19:24 +0200, Amir Goldstein wrote:
>> >> > On Mon, Feb 15, 2021 at 6:53 PM Trond Myklebust <
>> >> > trondmy@xxxxxxxxxxxxxxx> wrote:
>> >> > >
>> >> > > On Mon, 2021-02-15 at 18:34 +0200, Amir Goldstein wrote:
>> >> > > > On Mon, Feb 15, 2021 at 5:42 PM Luis Henriques <
>> >> > > > lhenriques@xxxxxxx>
>> >> > > > wrote:
>> >> > > > >
>> >> > > > > Nicolas Boichat reported an issue when trying to use the
>> >> > > > > copy_file_range
>> >> > > > > syscall on a tracefs file. It failed silently because the file
>> >> > > > > content is
>> >> > > > > generated on-the-fly (reporting a size of zero) and
>> >> > > > > copy_file_range
>> >> > > > > needs
>> >> > > > > to know in advance how much data is present.
>> >> > > > >
>> >> > > > > This commit restores the cross-fs restrictions that existed
>> >> > > > > prior
>> >> > > > > to
>> >> > > > > 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
>> >> > > > > devices")
>> >> > > > > and
>> >> > > > > removes generic_copy_file_range() calls from ceph, cifs, fuse,
>> >> > > > > and
>> >> > > > > nfs.
>> >> > > > >
>> >> > > > > Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
>> >> > > > > devices")
>> >> > > > > Link:
>> >> > > > > https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drinkcat@xxxxxxxxxxxx/
>> >> > > > > Cc: Nicolas Boichat <drinkcat@xxxxxxxxxxxx>
>> >> > > > > Signed-off-by: Luis Henriques <lhenriques@xxxxxxx>
>> >> > > >
>> >> > > > Code looks ok.
>> >> > > > You may add:
>> >> > > >
>> >> > > > Reviewed-by: Amir Goldstein <amir73il@xxxxxxxxx>
>> >> > > >
>> >> > > > I agree with Trond that the first paragraph of the commit message
>> >> > > > could
>> >> > > > be improved.
>> >> > > > The purpose of this change is to fix the change of behavior that
>> >> > > > caused the regression.
>> >> > > >
>> >> > > > Before v5.3, behavior was -EXDEV and userspace could fallback to
>> >> > > > read.
>> >> > > > After v5.3, behavior is zero size copy.
>> >> > > >
>> >> > > > It does not matter so much what makes sense for CFR to do in this
>> >> > > > case (generic cross-fs copy). What matters is that nobody asked
>> >> > > > for
>> >> > > > this change and that it caused problems.
>> >> > > >
>> >> > >
>> >> > > No. I'm saying that this patch should be NACKed unless there is a
>> >> > > real
>> >> > > explanation for why we give crap about this tracefs corner case and
>> >> > > why
>> >> > > it can't be fixed.
>> >> > >
>> >> > > There are plenty of reasons why copy offload across filesystems
>> >> > > makes
>> >> > > sense, and particularly when you're doing NAS. Clone just doesn't
>> >> > > cut
>> >> > > it when it comes to disaster recovery (whereas backup to a
>> >> > > different
>> >> > > storage unit does). If the client has to do the copy, then you're
>> >> > > effectively doubling the load on the server, and you're adding
>> >> > > potentially unnecessary network traffic (or at the very least you
>> >> > > are
>> >> > > doubling that traffic).
>> >> > >
>> >> >
>> >> > I don't understand the use case you are describing.
>> >> >
>> >> > Which filesystem types are you talking about for source and target
>> >> > of copy_file_range()?
>> >> >
>> >> > To be clear, the original change was done to support NFS/CIFS server-
>> >> > side
>> >> > copy and those should not be affected by this change.
>> >> >
>> >>
>> >> That is incorrect:
>> >>
>> >> ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file
>> >> *dst,
>> >> u64 dst_pos, u64 count)
>> >> {
>> >>
>> >> /*
>> >> * Limit copy to 4MB to prevent indefinitely blocking an nfsd
>> >> * thread and client rpc slot. The choice of 4MB is somewhat
>> >> * arbitrary. We might instead base this on r/wsize, or make it
>> >> * tunable, or use a time instead of a byte limit, or implement
>> >> * asynchronous copy. In theory a client could also recognize a
>> >> * limit like this and pipeline multiple COPY requests.
>> >> */
>> >> count = min_t(u64, count, 1 << 22);
>> >> return vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
>> >> }
>> >>
>> >> You are now explicitly changing the behaviour of knfsd when the source
>> >> and destination filesystem differ.
>> >>
>> >> For one thing, you are disallowing the NFSv4.2 copy offload use case of
>> >> copying from a local filesystem to a remote NFS server. However you are
>> >> also disallowing the copy from, say, an XFS formatted partition to an
>> >> ext4 partition.
>> >>
>> >
>> > Got it.
>>
>> Ugh. And I guess overlayfs may have a similar problem.
>>
>> > This is easy to solve with a flag COPY_FILE_SPLICE (or something) that
>> > is internal to kernel users.
>> >
>> > FWIW, you may want to look at the loop in ovl_copy_up_data()
>> > for improvements to nfsd_copy_file_range().
>> >
>> > We can move the check out to copy_file_range syscall:
>> >
>> > if (flags != 0)
>> > return -EINVAL;
>> >
>> > Leave the fallback from all filesystems and check for the
>> > COPY_FILE_SPLICE flag inside generic_copy_file_range().
>>
>> Ok, the diff bellow is just to make sure I understood your suggestion.
>>
>> The patch will also need to:
>>
>> - change nfs and overlayfs calls to vfs_copy_file_range() so that they
>> use the new flag.
>>
>> - check flags in generic_copy_file_checks() to make sure only valid flags
>> are used (COPY_FILE_SPLICE at the moment).
>>
>> Also, where should this flag be defined? include/uapi/linux/fs.h?
>
> Why would userspace want/need this flag?

In fact, my question sort of implied yours :-)

What I wanted to know was whether we would like to allow userspace to
_explicitly_ revert to the current behaviour (i.e. use the flag to allow
cross-fs copies) or to continue to return -EINVAL to userspace if flags
are != 0 (in which case this check would need to move to the syscall
definition).

Cheers,
--
Luis