Re: [PATCH v5 02/10] block: Add copy offload support infrastructure
From: Nitesh Shetty
Date: Tue Nov 29 2022 - 08:42:53 EST
On Thu, Nov 24, 2022 at 08:03:56AM +0800, Ming Lei wrote:
> On Wed, Nov 23, 2022 at 03:37:12PM +0530, Nitesh Shetty wrote:
> > On Wed, Nov 23, 2022 at 04:04:18PM +0800, Ming Lei wrote:
> > > On Wed, Nov 23, 2022 at 11:28:19AM +0530, Nitesh Shetty wrote:
> > > > Introduce blkdev_issue_copy which supports source and destination bdevs,
> > > > and an array of (source, destination and copy length) tuples.
> > > > Introduce REQ_COPY copy offload operation flag. Create a read-write
> > > > bio pair with a token as payload and submitted to the device in order.
> > > > Read request populates token with source specific information which
> > > > is then passed with write request.
> > > > This design is courtesy Mikulas Patocka's token based copy
> > >
> > > I thought this patchset is just for enabling copy command which is
> > > supported by hardware. But turns out it isn't, because blk_copy_offload()
> > > still submits read/write bios for doing the copy.
> > >
> > > I am just wondering why not let copy_file_range() cover this kind of copy,
> > > and the framework has been there.
> > >
> >
> > Main goal was to enable copy command, but community suggested to add
> > copy emulation as well.
> >
> > blk_copy_offload - actually issues copy command in driver layer.
> > The way read/write BIOs are percieved is different for copy offload.
> > In copy offload we check REQ_COPY flag in NVMe driver layer to issue
> > copy command. But we did missed it to add in other driver's, where they
> > might be treated as normal READ/WRITE.
> >
> > blk_copy_emulate - is used if we fail or if device doesn't support native
> > copy offload command. Here we do READ/WRITE. Using copy_file_range for
> > emulation might be possible, but we see 2 issues here.
> > 1. We explored possibility of pulling dm-kcopyd to block layer so that we
> > can readily use it. But we found it had many dependecies from dm-layer.
> > So later dropped that idea.
>
> Is it just because dm-kcopyd supports async copy? If yes, I believe we
> can reply on io_uring for implementing async copy_file_range, which will
> be generic interface for async copy, and could get better perf.
>
It supports both sync and async. But used only inside dm-layer.
Async version of copy_file_range can help, using io-uring can be helpful
for user , but in-kernel users can't use uring.
> > 2. copy_file_range, for block device atleast we saw few check's which fail
> > it for raw block device. At this point I dont know much about the history of
> > why such check is present.
>
> Got it, but IMO the check in generic_copy_file_checks() can be
> relaxed to cover blkdev cause splice does support blkdev.
>
> Then your bdev offload copy work can be simplified into:
>
> 1) implement .copy_file_range for def_blk_fops, suppose it is
> blkdev_copy_file_range()
>
> 2) inside blkdev_copy_file_range()
>
> - if the bdev supports offload copy, just submit one bio to the device,
> and this will be converted to one pt req to device
>
> - otherwise, fallback to generic_copy_file_range()
>
We will check the feasibilty and try to implement the scheme in next versions.
It would be helpful, if someone in community know's why such checks were
present ? We see copy_file_range accepts only regular file. Was it
designed only for regular files or can we extend it to regular block
device.
> >
> > > When I was researching pipe/splice code for supporting ublk zero copy[1], I
> > > have got idea for async copy_file_range(), such as: io uring based
> > > direct splice, user backed intermediate buffer, still zero copy, if these
> > > ideas are finally implemented, we could get super-fast generic offload copy,
> > > and bdev copy is really covered too.
> > >
> > > [1] https://lore.kernel.org/linux-block/20221103085004.1029763-1-ming.lei@xxxxxxxxxx/
> > >
> >
> > Seems interesting, We will take a look into this.
>
> BTW, that is probably one direction of ublk's async zero copy IO too.
>
>
> Thanks,
> Ming
>
>
Thanks,
Nitesh