Re: [PATCH 1/3] fs,block: Introduce IOCB_ZONE_APPEND and direct-io handling

From: Kanchan Joshi
Date: Thu Jun 18 2020 - 14:37:38 EST


On Thu, Jun 18, 2020 at 07:16:19AM +0000, Damien Le Moal wrote:
On 2020/06/18 2:27, Kanchan Joshi wrote:
From: Selvakumar S <selvakuma.s1@xxxxxxxxxxx>

Introduce IOCB_ZONE_APPEND flag, which is set in kiocb->ki_flags for
zone-append. Direct I/O submission path uses this flag to send bio with
append op. And completion path uses the same to return zone-relative
offset to upper layer.

Signed-off-by: SelvaKumar S <selvakuma.s1@xxxxxxxxxxx>
Signed-off-by: Kanchan Joshi <joshi.k@xxxxxxxxxxx>
Signed-off-by: Nitesh Shetty <nj.shetty@xxxxxxxxxxx>
Signed-off-by: Javier Gonzalez <javier.gonz@xxxxxxxxxxx>
---
fs/block_dev.c | 19 ++++++++++++++++++-
include/linux/fs.h | 1 +
2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 47860e5..4c84b4d0 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -185,6 +185,10 @@ static unsigned int dio_bio_write_op(struct kiocb *iocb)
/* avoid the need for a I/O completion work item */
if (iocb->ki_flags & IOCB_DSYNC)
op |= REQ_FUA;
+#ifdef CONFIG_BLK_DEV_ZONED
+ if (iocb->ki_flags & IOCB_ZONE_APPEND)
+ op |= REQ_OP_ZONE_APPEND | REQ_NOMERGE;
+#endif

No need for the #ifdef. And no need for the REQ_NOMERGE either since
REQ_OP_ZONE_APPEND requests are defined as not mergeable already.

return op;
}

@@ -295,6 +299,14 @@ static int blkdev_iopoll(struct kiocb *kiocb, bool wait)
return blk_poll(q, READ_ONCE(kiocb->ki_cookie), wait);
}

+#ifdef CONFIG_BLK_DEV_ZONED
+static inline long blkdev_bio_end_io_append(struct bio *bio)
+{
+ return (bio->bi_iter.bi_sector %
+ blk_queue_zone_sectors(bio->bi_disk->queue)) << SECTOR_SHIFT;

A zone size is at most 4G sectors as defined by the queue chunk_sectors limit
(unsigned int). It means that the return value here can overflow due to the
shift, at least on 32bits arch.

And as Pavel already commented, zone sizes are power of 2 so you can use
bitmasks instead of costly divisions.

+}
+#endif
+
static void blkdev_bio_end_io(struct bio *bio)
{
struct blkdev_dio *dio = bio->bi_private;
@@ -307,15 +319,20 @@ static void blkdev_bio_end_io(struct bio *bio)
if (!dio->is_sync) {
struct kiocb *iocb = dio->iocb;
ssize_t ret;
+ long res = 0;

if (likely(!dio->bio.bi_status)) {
ret = dio->size;
iocb->ki_pos += ret;
+#ifdef CONFIG_BLK_DEV_ZONED
+ if (iocb->ki_flags & IOCB_ZONE_APPEND)
+ res = blkdev_bio_end_io_append(bio);

overflow... And no need for the #ifdef.

+#endif
} else {
ret = blk_status_to_errno(dio->bio.bi_status);
}

- dio->iocb->ki_complete(iocb, ret, 0);
+ dio->iocb->ki_complete(iocb, ret, res);

ki_complete interface also needs to be changed to have a 64bit last argument to
avoid overflow.

And this all does not work anyway because __blkdev_direct_IO() and
__blkdev_direct_IO_simple() both call bio_iov_iter_get_pages() *before*
dio_bio_write_op() is called. This means that bio_iov_iter_get_pages() will not
see that it needs to get the pages for a zone append command and so will not
call __bio_iov_append_get_pages(). That leads to BIO split and potential
corruption of the user data as fragments of the kiocb may get reordered.

There is a lot more to do to the block_dev direct IO code for this to work.

Thanks a lot for the great review. Very helpful. We'll fix in V2.