Re: [PATCH v4 04/23] ext4: add iomap address space operations for buffered I/O
From: Ojaswin Mujoo
Date: Tue May 26 2026 - 13:13:44 EST
On Wed, May 20, 2026 at 10:49:50AM +0800, Zhang Yi wrote:
> On 5/20/2026 12:53 AM, Ojaswin Mujoo wrote:
> > On Tue, May 19, 2026 at 08:35:51PM +0800, Zhang Yi wrote:
> > > On 5/19/2026 5:28 PM, Ojaswin Mujoo wrote:
> > > > On Mon, May 11, 2026 at 03:23:24PM +0800, Zhang Yi wrote:
> > > > > From: Zhang Yi <yi.zhang@xxxxxxxxxx>
> > > > >
> > > > > Introduce initial support for iomap in the buffered I/O path for regular
> > > > > files on ext4.
> > > > >
> > > > > - Add a new inode state flag EXT4_STATE_BUFFERED_IOMAP to indicate the
> > > > > inode uses iomap instead of buffer_head for buffered I/O
> > > > > - Add helper ext4_inode_buffered_iomap() to check the flag
> > > > > - Add new address space operations ext4_iomap_aops with callbacks that
> > > > > will use generic iomap implementations
> > > > > - Add ext4_iomap_aops to ext4_set_aops() when the flag is set
> > > > >
> > > > > The following callbacks(read_folio(), readahead(), writepages()) are
> > > > > provided as placeholders and will be implemented in later patches.
> > > > >
> > > > > Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
> > > > > Reviewed-by: Jan Kara <jack@xxxxxxx>
> > > >
> > > > Hi Zhang, looks good to me. Just a questions below:
> > >
> > > Hi, Ojaswin! Thank you for the review of this series.
> > >
> > > > > ---
> > > > > fs/ext4/ext4.h | 7 +++++++
> > > > > fs/ext4/inode.c | 32 ++++++++++++++++++++++++++++++++
> > > > > 2 files changed, 39 insertions(+)
> > > > >
> > > > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > > > > index 94283a991e5c..1e27d73d7427 100644
> > > > > --- a/fs/ext4/ext4.h
> > > > > +++ b/fs/ext4/ext4.h
> > > > > @@ -1972,6 +1972,7 @@ enum {
> > > > > EXT4_STATE_FC_COMMITTING, /* Fast commit ongoing */
> > > > > EXT4_STATE_FC_FLUSHING_DATA, /* Fast commit flushing data */
> > > > > EXT4_STATE_ORPHAN_FILE, /* Inode orphaned in orphan file */
> > > > > + EXT4_STATE_BUFFERED_IOMAP, /* Inode use iomap for buffered IO */
> > > > > };
> > > > > #define EXT4_INODE_BIT_FNS(name, field, offset) \
> > > > > @@ -2040,6 +2041,12 @@ static inline bool ext4_inode_orphan_tracked(struct inode *inode)
> > > > > !list_empty(&EXT4_I(inode)->i_orphan);
> > > > > }
> > > > > +/* Whether the inode pass through the iomap infrastructure for buffered I/O */
> > > > > +static inline bool ext4_inode_buffered_iomap(struct inode *inode)
> > > > > +{
> > > > > + return ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP);
> > > > > +}
> > > > > +
> > > > > /*
> > > > > * Codes for operating systems
> > > > > */
> > > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > > > index b1ef706987c3..178ac2be37b7 100644
> > > > > --- a/fs/ext4/inode.c
> > > > > +++ b/fs/ext4/inode.c
> > > > > @@ -3908,6 +3908,22 @@ const struct iomap_ops ext4_iomap_report_ops = {
> > > > > .iomap_begin = ext4_iomap_begin_report,
> > > > > };
> > > > > +static int ext4_iomap_read_folio(struct file *file, struct folio *folio)
> > > > > +{
> > > > > + return 0;
> > > > > +}
> > > > > +
> > > > > +static void ext4_iomap_readahead(struct readahead_control *rac)
> > > > > +{
> > > > > +
> > > > > +}
> > > > > +
> > > > > +static int ext4_iomap_writepages(struct address_space *mapping,
> > > > > + struct writeback_control *wbc)
> > > > > +{
> > > > > + return 0;
> > > > > +}
> > > > > +
> > > > > /*
> > > > > * For data=journal mode, folio should be marked dirty only when it was
> > > > > * writeably mapped. When that happens, it was already attached to the
> > > > > @@ -3994,6 +4010,20 @@ static const struct address_space_operations ext4_da_aops = {
> > > > > .swap_activate = ext4_iomap_swap_activate,
> > > > > };
> > > > > +static const struct address_space_operations ext4_iomap_aops = {
> > > > > + .read_folio = ext4_iomap_read_folio,
> > > > > + .readahead = ext4_iomap_readahead,
> > > > > + .writepages = ext4_iomap_writepages,
> > > > > + .dirty_folio = iomap_dirty_folio,
> > > > > + .bmap = ext4_bmap,
> > > > > + .invalidate_folio = iomap_invalidate_folio,
> > > > > + .release_folio = iomap_release_folio,
> > > > > + .migrate_folio = filemap_migrate_folio,
> > > > > + .is_partially_uptodate = iomap_is_partially_uptodate,
> > > > > + .error_remove_folio = generic_error_remove_folio,
> > > > > + .swap_activate = ext4_iomap_swap_activate,
> > > > > +};
> > > >
> > > > So one question, for ->release_folio() we are using
> > > > iomap_release_folio() instead of ext4_release_folio() here which doesnt
> > > > make the jbd2_journal_try_to_free_bufferes() call. IIUC this function
> > > > seems to be trying to clean up already checkpointed buffers.
> > > >
> > > > I wanted to check if ->release_folio() can be called for folios with
> > > > ext4 metadata buffers? (from my limited understanding of
> > > > shrink_folio_list() -> filemap_release_folio() it seems we can) And if
> > > > it can be called, is it okay to skip the
> > > > jbd2_journal_try_to_free_buffers call?
> > >
> > > Here, in ->release_folio(), folio->mapping points to inode->i_data (the
> > > file's pagecache), not the block device's pagecache. ext4 metadata
> > > resides in the block device's pagecache, which is at a different layer
> > > than this release_folio callback. So we don't need to call
> > > jbd2_journal_try_to_free_buffers() in the iomap path here.
> >
> > Hi Yi,
> >
> > Thanks for clarify and yes, thats what I was missing! So this
> > ->release_folio() is only for data folios. So I guess the
> > jbd2_journal_try_to_free_buffers() is mostly to handle data=journal
> > case?
>
> Yes, that's my understanding as well. Meanwhile, the comment for the
> jbd2_journal_try_to_free_buffers() function looks quite outdated and
> needs to be updated.
Looks good, thanks for explanation and fixing it.
Regards,
ojaswin
>
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index 4885903bbd10..239bcf88ed1c 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -2139,38 +2139,23 @@ static void __jbd2_journal_unfile_buffer(struct
> journal_head *jh)
> }
>
> /**
> - * jbd2_journal_try_to_free_buffers() - try to free page buffers.
> + * jbd2_journal_try_to_free_buffers() - try to free folio buffers.
> * @journal: journal for operation
> * @folio: Folio to detach data from.
> *
> - * For all the buffers on this page,
> - * if they are fully written out ordered data, move them onto BUF_CLEAN
> - * so try_to_free_buffers() can reap them.
> + * For each buffer_head on @folio, if the buffer has a journal_head but
> + * is not attached to a running or committing transaction, try to remove
> + * it from the checkpoint list. This is needed for data=journal mode
> + * where data buffers are journaled: once they are checkpointed, the
> + * journal_head can be detached and the buffer freed. If any buffer is
> + * still attached to a transaction, the folio cannot be released and we
> + * bail out. Otherwise we call try_to_free_buffers() to detach all
> + * buffer_heads from the folio.
> *
> - * This function returns non-zero if we wish try_to_free_buffers()
> - * to be called. We do this if the page is releasable by
> try_to_free_buffers().
> - * We also do it if the page has locked or dirty buffers and the caller
> wants
> - * us to perform sync or async writeout.
> + * For data=ordered and writeback modes, data buffers never have
> + * journal_heads, so this degenerates to a plain try_to_free_buffers().
> *
> - * This complicates JBD locking somewhat. We aren't protected by the
> - * BKL here. We wish to remove the buffer from its committing or
> - * running transaction's ->t_datalist via __jbd2_journal_unfile_buffer.
> - *
> - * This may *change* the value of transaction_t->t_datalist, so anyone
> - * who looks at t_datalist needs to lock against this function.
> - *
> - * Even worse, someone may be doing a jbd2_journal_dirty_data on this
> - * buffer. So we need to lock against that. jbd2_journal_dirty_data()
> - * will come out of the lock with the buffer dirty, which makes it
> - * ineligible for release here.
> - *
> - * Who else is affected by this? hmm... Really the only contender
> - * is do_get_write_access() - it could be looking at the buffer while
> - * journal_try_to_free_buffer() is changing its state. But that
> - * cannot happen because we never reallocate freed data as metadata
> - * while the data is part of a transaction. Yes?
> - *
> - * Return false on failure, true on success
> + * Return: true if the folio's buffers were freed, false otherwise
> */
> bool jbd2_journal_try_to_free_buffers(journal_t *journal, struct folio
> *folio)
> {
>
> Thanks,
> Yi.
>
>
>