Re: [RFC PATCH v2 2/5] iomap: Add initial support for buffered RWF_WRITETHROUGH

From: Ojaswin Mujoo

Date: Tue Apr 21 2026 - 14:21:46 EST

On Mon, Apr 20, 2026 at 01:56:02PM +0200, Pankaj Raghav (Samsung) wrote:
> > +
> > + if (wt_ops->writethrough_submit)
> > + wt_ops->writethrough_submit(wt_ctx->inode, iomap, wt_ctx->bio_pos,
> > + len);
> > +
> > + bio = bio_alloc(iomap->bdev, wt_ctx->nr_bvecs, REQ_OP_WRITE, GFP_NOFS);
>
> We might want to check if bio_alloc succeeded here.

Hi Pankaj, so we pass GFP_NOFS which has GFP_DIRECT_RECLAIM and
according to comment over bio_alloc()

* If %__GFP_DIRECT_RECLAIM is set then bio_alloc will always be able to
* allocate a bio. This is due to the mempool guarantees. To make this work,
* callers must never allocate more than 1 bio at a time from the general pool.

And we seem to be following this.

>
> > + bio->bi_iter.bi_sector = iomap_sector(iomap, wt_ctx->bio_pos);
> > + bio->bi_end_io = iomap_writethrough_bio_end_io;
> > + bio->bi_private = wt_ctx;
> > +
> > + for (i = 0; i < wt_ctx->nr_bvecs; i++)
> > + __bio_add_page(bio, wt_ctx->bvec[i].bv_page,
> > + wt_ctx->bvec[i].bv_len,
> > + wt_ctx->bvec[i].bv_offset);
> > +
> > + atomic_inc(&wt_ctx->ref);
> > + submit_bio(bio);
> > + wt_ctx->nr_bvecs = 0;
> > +}
> > +
> <snip>
> > +
> > +/**
> > + * iomap_writethrough_iter - perform RWF_WRITETHROUGH buffered write
> > + * @wt_ctx: writethrough context
> > + * @iter: iomap iter holding mapping information
> > + * @i: iov_iter for write
> > + * @wt_ops: the fs callbacks needed for writethrough
> > + *
> > + * This function copies the user buffer to folio similar to usual buffered
> > + * IO path, with the difference that we immediately issue the IO. For this we
> > + * utilize IO submission and completion mechanism that is inspired by dio.
> > + *
> > + * Folio handling note: We might be writing through a partial folio so we need
> > + * to be careful to not clear the folio dirty bit unless there are no dirty blocks
> > + * in the folio after the writethrough.
> > + */
> > +static int iomap_writethrough_iter(struct iomap_writethrough_ctx *wt_ctx,
> > + struct iomap_iter *iter, struct iov_iter *i,
> > + const struct iomap_writethrough_ops *wt_ops)
> > +
> > +{
> > + ssize_t total_written = 0;
> > + int status = 0;
> > + struct address_space *mapping = iter->inode->i_mapping;
> > + size_t chunk = mapping_max_folio_size(mapping);
> > + unsigned int bdp_flags = (iter->flags & IOMAP_NOWAIT) ? BDP_ASYNC : 0;
> > + unsigned int bs = i_blocksize(iter->inode);
> > +
> > + /* copied over based on DIO handles these flags */
> > + if (iter->iomap.type == IOMAP_UNWRITTEN)
> > + wt_ctx->flags |= IOMAP_DIO_UNWRITTEN;
> > + if (iter->iomap.flags & IOMAP_F_SHARED)
> > + wt_ctx->flags |= IOMAP_DIO_COW;
> > +
> > + if (!(iter->flags & IOMAP_WRITETHROUGH))
> > + return -EINVAL;
> > +
> > + do {
> > + struct folio *folio;
> > + size_t offset; /* Offset into folio */
> > + u64 bytes; /* Bytes to write to folio */
> > + size_t copied; /* Bytes copied from user */
> > + u64 written; /* Bytes have been written */
> > + loff_t pos;
> > + size_t off_aligned, len_aligned;
> > +
> > + bytes = iov_iter_count(i);
> > +retry:
> > + offset = iter->pos & (chunk - 1);
> > + bytes = min(chunk - offset, bytes);
> > + status = balance_dirty_pages_ratelimited_flags(mapping,
> > + bdp_flags);
> > + if (unlikely(status))
> > + break;
> > +
> > + /*
> > + * If completions already occurred and reported errors, give up
> > + * now and don't bother submitting more bios.
> > + */
> > + if (unlikely(data_race(wt_ctx->error))) {
>
> In the unlikely scenario where we encounter an error, do we have to also
> clear the writeback flag on all the folios that is part of this
> bvec until now?
>
> Something like explicitly iterate over wt_ctx->bvec[0] through
> wt_ctx->bvec[nr_bvecs - 1], manually call folio_end_writeback(bvec[i].bv_page)
> on them, and then discard the bvecs by setting the nr_bvecs = 0;
>
> I am wondering if the folios that were processed until now will be in
> PG_WRITEBACK state which can affect reclaim as we never clear the flag.

Hey Pankaj, yes you are right. I think the error handling is a bit buggy
and Sashiko has also pointed some of these. I'll take care of this in
v3, thanks for pointing this out.

Regards,
ojaswin

>
> > + wt_ctx->nr_bvecs = 0;
> > + break;
> > + }
> > +
>
> --
> Pankaj