Re: [RFC PATCH v2 2/5] iomap: Add initial support for buffered RWF_WRITETHROUGH
From: Jan Kara
Date: Thu Apr 16 2026 - 08:13:52 EST
On Thu 09-04-26 00:15:43, Ojaswin Mujoo wrote:
> This adds initial support for performing buffered non-aio
> RWF_WRITETHROUGH write. The rough flow for a writethrough write is as
> follows:
>
> 1. Acquire inode lock
> 2. initialize writethrough context (wt_ctx) and mark
> mapping as stable.
> 3. Start the iomap_iter() loop. For each iomap:
> 3.1. Acquire folio and folio_lock.
> 3.2. perform memcpy from user buffer to the folio and mark it
> dirty
> 3.3. Wait for any current writeback to complete and then call
> folio_mkclean() to prevent mmap writes from changing it.
> 3.4. Start writeback on the folio
> 3.5. Add the folio range under write to wt_ctx->bvec and folio_unlock()
> 3.6. If bvec is full, submit the current bvecs for IO.
> 3.7. Repeat 3.2 to 3.6 till the whole iomap is processed. Submit
> the final set of bvecs for IO.
> 4. Repeat step 3 till we have no more data to write.
> 5. Finally, sleep in the syscall thread till all the IOs are
> completed (refcount == 0). Once that happens, the end io handler will
> wake us up.
> 6. Upon waking up, call fs ->end_io() callback (which updates inode
> size), record any errors and return.
> 7. inode_unlock()
>
> This design gives buffered writethrough the same semantics as dio and
> any error in the IO is directly returned to the caller. The design has
> delibrately open coded the IO submission and completion flow (inspired
> by dio) rather than reusing the dio functions as accomodating buffered
> writethrough logic in dio code was polluting it with too many if else
> conditionals and special cases.
>
> Suggested-by: Jan Kara <jack@xxxxxxx>
> Suggested-by: Dave Chinner <dgc@xxxxxxxxxx>
> Co-developed-by: Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx>
> Signed-off-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx>
Overall this looks good to me. Just a few smaller things below:
> +static int iomap_writethrough_iter(struct iomap_writethrough_ctx *wt_ctx,
> + struct iomap_iter *iter, struct iov_iter *i,
> + const struct iomap_writethrough_ops *wt_ops)
> +
> +{
> + ssize_t total_written = 0;
> + int status = 0;
> + struct address_space *mapping = iter->inode->i_mapping;
> + size_t chunk = mapping_max_folio_size(mapping);
> + unsigned int bdp_flags = (iter->flags & IOMAP_NOWAIT) ? BDP_ASYNC : 0;
> + unsigned int bs = i_blocksize(iter->inode);
> +
> + /* copied over based on DIO handles these flags */
^ missing 'how' here
> +ssize_t iomap_file_writethrough_write(struct kiocb *iocb, struct iov_iter *i,
> + const struct iomap_writethrough_ops *wt_ops,
> + void *private)
> +{
> + struct inode *inode = iocb->ki_filp->f_mapping->host;
> + struct iomap_iter iter = {
> + .inode = inode,
> + .pos = iocb->ki_pos,
> + .len = iov_iter_count(i),
> + .flags = IOMAP_WRITE | IOMAP_WRITETHROUGH,
> + .private = private,
> + };
> + struct iomap_writethrough_ctx *wt_ctx;
> + unsigned int max_bvecs;
> + ssize_t ret;
> +
> +
> + /*
> + * For now we don't support any other flag with WRITETHROUGH
> + */
> + if (!(iocb->ki_flags & IOCB_WRITETHROUGH))
> + return -EINVAL;
> + if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_DONTCACHE))
> + return -EINVAL;
> + if (iocb_is_dsync(iocb))
> + /* D_SYNC support not implemented yet */
> + return -EOPNOTSUPP;
> + if (!is_sync_kiocb(iocb))
> + /* aio support not implemented yet */
> + return -EOPNOTSUPP;
> +
> + /*
> + * +1 to max bvecs to account for unaligned write spanning multiple
> + * folios
> + */
> + max_bvecs = DIV_ROUND_UP(
> + iov_iter_count(i),
> + PAGE_SIZE << mapping_min_folio_order(inode->i_mapping)) + 1;
Can this overflow? iov_iter_count() returns size_t which is ulong.
> +
> + if (max_bvecs > BIO_MAX_VECS)
> + max_bvecs = BIO_MAX_VECS;
> + if (!max_bvecs)
> + max_bvecs = 1;
I don't think 0 is possible here since we do +1 in max_bvecs computation
above.
> +
> + wt_ctx = kzalloc(struct_size(wt_ctx, bvec, max_bvecs), GFP_NOFS);
> + if (!wt_ctx)
> + return -ENOMEM;
> +
> + wt_ctx->iocb = iocb;
> + wt_ctx->inode = inode;
> + wt_ctx->dops = wt_ops->dops;
> + wt_ctx->pos = iocb->ki_pos;
> + wt_ctx->new_i_size = i_size_read(inode);
> + wt_ctx->max_bvecs = max_bvecs;
> + atomic_set(&wt_ctx->ref, 1);
> + wt_ctx->waiter = current;
> +
> + mapping_set_stable_writes(inode->i_mapping);
> +
We should check if mapping is already marked as requiring stable pages
avoid messing with (in particular clearing) the flag in that case.
> + while ((ret = iomap_iter(&iter, wt_ops->ops)) > 0) {
> + WARN_ON(iter.iomap.type != IOMAP_UNWRITTEN &&
> + iter.iomap.type != IOMAP_MAPPED);
> + iter.status = iomap_writethrough_iter(wt_ctx, &iter, i, wt_ops);
> + }
> + if (ret < 0)
> + cmpxchg(&wt_ctx->error, 0, ret);
> +
> + if (!atomic_dec_and_test(&wt_ctx->ref)) {
> + for (;;) {
> + set_current_state(TASK_UNINTERRUPTIBLE);
> + if (!READ_ONCE(wt_ctx->waiter))
> + break;
> + blk_io_schedule();
> + }
> + __set_current_state(TASK_RUNNING);
> + }
> +
> + return iomap_writethrough_complete(wt_ctx);
> +}
Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR