Re: [PATCH v4] iomap: add simple read path for small direct I/O

From: Fengnan

Date: Wed Jun 24 2026 - 22:25:12 EST

在 2026/6/24 23:37, Christoph Hellwig 写道:
> Sorry for the delay in getting back to this, I'm a bit overloaded at
> the moment.
>
>> -static inline bool should_report_dio_fserror(const struct iomap_dio *dio)
>> +static inline bool should_report_dio_fserror(int error)
> Can you split all the refactoring into prep patches?
off course

>
>> +/*
>> + * In the async simple read path, we need to prevent bio_endio() from
>> + * triggering iocb->ki_complete() before the submitter has returned
>> + * -EIOCBQUEUED. Otherwise, the caller might free the iocb concurrently.
>> + *
>> + * We use a three-state rendezvous to synchronize the submitter and end_io:
>> + *
>> + * IOMAP_DIO_SIMPLE_SUBMITTING: Initial state set before submitting the bio.
>> + *
>> + * IOMAP_DIO_SIMPLE_QUEUED: The submitter has safely queued the IO and will
>> + * return -EIOCBQUEUED. If end_io sees this state, it takes over and calls
>> + * ki_complete().
>> + *
>> + * IOMAP_DIO_SIMPLE_DONE: end_io fired before the submitter finished the
>> + * submit path. end_io sets this state and does nothing else. The submitter
>> + * will see this state and handle the completion synchronously (bypassing
>> + * ki_complete() and returning the actual result).
>> + */
> I don't think we actually need any of this. For the sync case we
> can just use submit_bio_wait, and for async just always complete
> from the end_io handler. This will simplify the implementation a lot,
> and also avoid the atomic.

I was wrong before, in simple read path, we won't use sr after submit bio.
>
>> +static void iomap_dio_simple_read_async_done(struct iomap_dio_simple_read *sr)
> Btw, I'd drop the _read in the name. Most of this would work as-is
> for trivial overwrites if we figure out when to use them.
ok, let's rename to iomap_dio_simple_xxx

>
>> + if (dio_flags & IOMAP_DIO_BOUNCE)
>> + nr_pages = bio_iov_bounce_nr_vecs(iter, REQ_OP_READ);
>> + else
>> + nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS);
> Bounce buffering requires dops, so all this can be dropped.
Get .
>
>> + ret = ops->iomap_begin(inode, iomi.pos, count, iomi.flags,
>> + &iomi.iomap, &iomi.srcmap);
>> + if (ret) {
>> + inode_dio_end(inode);
>> + return ret;
>> + }
>> +
>> + if (iomi.iomap.type != IOMAP_MAPPED ||
>> + iomi.iomap.offset > iomi.pos ||
> I don't think offset > pos can happen

>
>> + iomi.iomap.offset + iomi.iomap.length < iomi.pos + count ||
>> + (iomi.iomap.flags & IOMAP_F_INTEGRITY)) {
>> + ret = -ENOTBLK;
>> + goto out_iomap_end;
>> + }
> Given that we already have a fallback here, I'm not sure why this is
> limited to a single file system block. Anything that:
>
> a) fits into the iomap
> b) fits into a single bio
>
> can be easily supported. The first condition is a trivial, and for
> the second we could just check if iter->nr_segs is larger than
> BIO_MAX_VECS.
The reason I only added a simple path for 4K reads is that, in current
NVMe, 4K random reads suffer from a significant bottleneck, whereas
8K reads and 4K writes do not.
Considering that if an 8K or larger block size does not fit within a single
BIO, two iomap_begin/iomap_end calls would be required, resulting in
additional overhead.
Of course, even without considering this scenario, there would still be
some benefit, it’s just not as significant.
How do you approach this issue?

>
>> + if (user_backed_iter(iter))
>> + dio_flags |= IOMAP_DIO_USER_BACKED;
>> + bio->bi_iter.bi_sector = iomap_sector(&iomi.iomap, iomi.pos);
>> + bio->bi_ioprio = iocb->ki_ioprio;
>> + bio->bi_private = sr;
>> + bio->bi_end_io = iomap_dio_simple_read_end_io;
>> +
>> + if ((dio_flags & IOMAP_DIO_USER_BACKED) &&
>> + !(dio_flags & IOMAP_DIO_BOUNCE))
>> + bio_set_pages_dirty(bio);
>> +
>> + if (iocb->ki_flags & IOCB_NOWAIT)
>> + bio->bi_opf |= REQ_NOWAIT;
>> + if ((iocb->ki_flags & IOCB_HIPRI) && !wait_for_completion) {
>> + bio->bi_opf |= REQ_POLLED;
>> + bio_set_polled(bio, iocb);
>> + WRITE_ONCE(iocb->private, bio);
>> + }
> Can you check if sone more of this can be factored into a shared
> helper?
I'll try.

>
> Below is a completely untested patch implementing my suggestion
> for the completion simplification. It compiles, but that's about
> the guarantees I can give for it:
I'll apply and do some test.

>
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 3cb179752612..c785512e5339 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -909,11 +909,7 @@ struct iomap_dio_simple_read {
> struct kiocb *iocb;
> size_t size;
> unsigned int dio_flags;
> - atomic_t state;
> - union {
> - struct task_struct *waiter;
> - struct work_struct work;
> - };
> + struct work_struct work;
> /*
> * Align @bio to a cacheline boundary so that, combined with the
> * front_pad passed to bioset_init(), the bio sits at the start of
> @@ -926,35 +922,12 @@ struct iomap_dio_simple_read {
>
> static struct bio_set iomap_dio_simple_read_pool;
>
> -/*
> - * In the async simple read path, we need to prevent bio_endio() from
> - * triggering iocb->ki_complete() before the submitter has returned
> - * -EIOCBQUEUED. Otherwise, the caller might free the iocb concurrently.
> - *
> - * We use a three-state rendezvous to synchronize the submitter and end_io:
> - *
> - * IOMAP_DIO_SIMPLE_SUBMITTING: Initial state set before submitting the bio.
> - *
> - * IOMAP_DIO_SIMPLE_QUEUED: The submitter has safely queued the IO and will
> - * return -EIOCBQUEUED. If end_io sees this state, it takes over and calls
> - * ki_complete().
> - *
> - * IOMAP_DIO_SIMPLE_DONE: end_io fired before the submitter finished the
> - * submit path. end_io sets this state and does nothing else. The submitter
> - * will see this state and handle the completion synchronously (bypassing
> - * ki_complete() and returning the actual result).
> - */
> -enum {
> - IOMAP_DIO_SIMPLE_SUBMITTING = 0,
> - IOMAP_DIO_SIMPLE_QUEUED,
> - IOMAP_DIO_SIMPLE_DONE,
> -};
> -
> -static ssize_t iomap_dio_simple_read_finish(struct kiocb *iocb,
> - struct bio *bio, ssize_t ret)
> +static ssize_t iomap_dio_simple_read_complete(struct iomap_dio_simple_read *sr)
> {
> + struct bio *bio = &sr->bio;
> + struct kiocb *iocb = sr->iocb;
> struct inode *inode = file_inode(iocb->ki_filp);
> - struct iomap_dio_simple_read *sr = bio->bi_private;
> + ssize_t ret = blk_status_to_errno(bio->bi_status);
>
> if (likely(!ret)) {
> ret = sr->size;
> @@ -965,21 +938,6 @@ static ssize_t iomap_dio_simple_read_finish(struct kiocb *iocb,
> }
>
> iomap_dio_bio_release_pages(bio, sr->dio_flags, ret < 0);
> -
> - return ret;
> -}
> -
> -static ssize_t iomap_dio_simple_read_complete(struct kiocb *iocb,
> - struct bio *bio)
> -{
> - struct inode *inode = file_inode(iocb->ki_filp);
> - ssize_t ret;
> -
> - WRITE_ONCE(iocb->private, NULL);
> -
> - ret = iomap_dio_simple_read_finish(iocb, bio,
> - blk_status_to_errno(bio->bi_status));
> -
> inode_dio_end(inode);
> trace_iomap_dio_complete(iocb, ret < 0 ? ret : 0, ret > 0 ? ret : 0);
> return ret;
> @@ -988,45 +946,26 @@ static ssize_t iomap_dio_simple_read_complete(struct kiocb *iocb,
> static void iomap_dio_simple_read_complete_work(struct work_struct *work)
> {
> struct iomap_dio_simple_read *sr =
> - container_of(work, struct iomap_dio_simple_read, work);
> - struct kiocb *iocb = sr->iocb;
> - ssize_t ret;
> + container_of(work, struct iomap_dio_simple_read, work);
>
> - ret = iomap_dio_simple_read_complete(iocb, &sr->bio);
> - iocb->ki_complete(iocb, ret);
> + WRITE_ONCE(sr->iocb->private, NULL);
> + sr->iocb->ki_complete(sr->iocb, iomap_dio_simple_read_complete(sr));
> }
>
> -static void iomap_dio_simple_read_async_done(struct iomap_dio_simple_read *sr)
> +static void iomap_dio_simple_read_end_io(struct bio *bio)
> {
> - struct kiocb *iocb = sr->iocb;
> + struct iomap_dio_simple_read *sr =
> + container_of(bio, struct iomap_dio_simple_read, bio);
>
> if (unlikely(sr->bio.bi_status)) {
> - struct inode *inode = file_inode(iocb->ki_filp);
> + struct inode *inode = file_inode(sr->iocb->ki_filp);
>
> INIT_WORK(&sr->work, iomap_dio_simple_read_complete_work);
> queue_work(inode->i_sb->s_dio_done_wq, &sr->work);
> return;
> }
>
> - iomap_dio_simple_read_complete_work(&sr->work);
> -}
> -
> -static void iomap_dio_simple_read_end_io(struct bio *bio)
> -{
> - struct iomap_dio_simple_read *sr = bio->bi_private;
> -
> - if (sr->waiter) {
> - struct task_struct *waiter = sr->waiter;
> -
> - WRITE_ONCE(sr->waiter, NULL);
> - blk_wake_io_task(waiter);
> - return;
> - }
> -
> - if (likely(atomic_read(&sr->state) == IOMAP_DIO_SIMPLE_QUEUED) ||
> - atomic_cmpxchg(&sr->state, IOMAP_DIO_SIMPLE_SUBMITTING,
> - IOMAP_DIO_SIMPLE_DONE) == IOMAP_DIO_SIMPLE_QUEUED)
> - iomap_dio_simple_read_async_done(sr);
> + sr->iocb->ki_complete(sr->iocb, iomap_dio_simple_read_complete(sr));
> }
>
> static inline bool iomap_dio_simple_read_supported(struct kiocb *iocb,
> @@ -1046,11 +985,13 @@ static inline bool iomap_dio_simple_read_supported(struct kiocb *iocb,
> */
> if (count > inode->i_sb->s_blocksize)
> return false;
> - if (dio_flags & (IOMAP_DIO_FORCE_WAIT | IOMAP_DIO_PARTIAL))
> + if (dio_flags & (IOMAP_DIO_FORCE_WAIT | IOMAP_DIO_PARTIAL |
> + IOMAP_DIO_BOUNCE))
> return false;
> if (iocb->ki_pos + count > i_size_read(inode))
> return false;
>
> + // XXX: reject fscrypt
> return true;
> }
>
> @@ -1060,7 +1001,6 @@ static ssize_t iomap_dio_simple_read(struct kiocb *iocb,
> {
> struct inode *inode = file_inode(iocb->ki_filp);
> size_t count = iov_iter_count(iter);
> - int nr_pages;
> struct iomap_dio_simple_read *sr;
> unsigned int alignment;
> struct iomap_iter iomi = {
> @@ -1074,11 +1014,6 @@ static ssize_t iomap_dio_simple_read(struct kiocb *iocb,
> bool wait_for_completion = is_sync_kiocb(iocb);
> ssize_t ret;
>
> - if (dio_flags & IOMAP_DIO_BOUNCE)
> - nr_pages = bio_iov_bounce_nr_vecs(iter, REQ_OP_READ);
> - else
> - nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS);
> -
> if (iocb->ki_flags & IOCB_NOWAIT)
> iomi.flags |= IOMAP_NOWAIT;
>
> @@ -1120,24 +1055,18 @@ static ssize_t iomap_dio_simple_read(struct kiocb *iocb,
> if (user_backed_iter(iter))
> dio_flags |= IOMAP_DIO_USER_BACKED;
>
> - bio = bio_alloc_bioset(iomi.iomap.bdev, nr_pages,
> - REQ_OP_READ | REQ_SYNC | REQ_IDLE,
> - GFP_KERNEL, &iomap_dio_simple_read_pool);
> + bio = bio_alloc_bioset(iomi.iomap.bdev,
> + bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS),
> + REQ_OP_READ | REQ_SYNC | REQ_IDLE,
> + GFP_KERNEL, &iomap_dio_simple_read_pool);
> sr = container_of(bio, struct iomap_dio_simple_read, bio);
> -
> - fscrypt_set_bio_crypt_ctx(bio, inode, iomi.pos, GFP_KERNEL);
> sr->iocb = iocb;
> sr->dio_flags = dio_flags;
>
> bio->bi_iter.bi_sector = iomap_sector(&iomi.iomap, iomi.pos);
> bio->bi_ioprio = iocb->ki_ioprio;
> - bio->bi_private = sr;
> - bio->bi_end_io = iomap_dio_simple_read_end_io;
>
> - if (dio_flags & IOMAP_DIO_BOUNCE)
> - ret = bio_iov_iter_bounce(bio, iter, count);
> - else
> - ret = bio_iov_iter_get_pages(bio, iter, alignment - 1);
> + ret = bio_iov_iter_get_pages(bio, iter, alignment - 1);
> if (unlikely(ret))
> goto out_bio_put;
>
> @@ -1161,49 +1090,22 @@ static ssize_t iomap_dio_simple_read(struct kiocb *iocb,
> WRITE_ONCE(iocb->private, bio);
> }
>
> - if (wait_for_completion) {
> - sr->waiter = current;
> - blk_crypto_submit_bio(bio);
> - } else {
> - atomic_set(&sr->state, IOMAP_DIO_SIMPLE_SUBMITTING);
> - sr->waiter = NULL;
> - blk_crypto_submit_bio(bio);
> - ret = -EIOCBQUEUED;
> - }
> -
> if (ops->iomap_end)
> ops->iomap_end(inode, iomi.pos, count, count, iomi.flags,
> &iomi.iomap);
>
> - if (wait_for_completion) {
> - for (;;) {
> - set_current_state(TASK_UNINTERRUPTIBLE);
> - if (!READ_ONCE(sr->waiter))
> - break;
> - blk_io_schedule();
> - }
> - __set_current_state(TASK_RUNNING);
> -
> - ret = iomap_dio_simple_read_finish(iocb, bio,
> - blk_status_to_errno(bio->bi_status));
> - inode_dio_end(inode);
> - trace_iomap_dio_complete(iocb, ret < 0 ? ret : 0,
> - ret > 0 ? ret : 0);
> - } else if (atomic_cmpxchg(&sr->state, IOMAP_DIO_SIMPLE_SUBMITTING,
> - IOMAP_DIO_SIMPLE_QUEUED) ==
> - IOMAP_DIO_SIMPLE_DONE) {
> - ret = iomap_dio_simple_read_complete(iocb, bio);
> - } else {
> + if (!wait_for_completion) {
> + bio->bi_end_io = iomap_dio_simple_read_end_io;
> + submit_bio(bio);
> trace_iomap_dio_rw_queued(inode, iomi.pos, count);
> + return -EIOCBQUEUED;
> }
>
> - return ret;
> + submit_bio_wait(bio);
> + return iomap_dio_simple_read_complete(sr);
>
> out_bio_release_pages:
> - if (dio_flags & IOMAP_DIO_BOUNCE)
> - bio_iov_iter_unbounce(bio, true, false);
> - else
> - bio_release_pages(bio, false);
> + bio_release_pages(bio, false);
> out_bio_put:
> bio_put(bio);
> out_iomap_end: