Re: [PATCH v4 04/23] ext4: add iomap address space operations for buffered I/O

From: Zhang Yi

Date: Tue May 19 2026 - 22:50:44 EST

On 5/20/2026 12:53 AM, Ojaswin Mujoo wrote:

On Tue, May 19, 2026 at 08:35:51PM +0800, Zhang Yi wrote:

On 5/19/2026 5:28 PM, Ojaswin Mujoo wrote:

On Mon, May 11, 2026 at 03:23:24PM +0800, Zhang Yi wrote:

From: Zhang Yi <yi.zhang@xxxxxxxxxx>

Introduce initial support for iomap in the buffered I/O path for regular
files on ext4.

- Add a new inode state flag EXT4_STATE_BUFFERED_IOMAP to indicate the
inode uses iomap instead of buffer_head for buffered I/O
- Add helper ext4_inode_buffered_iomap() to check the flag
- Add new address space operations ext4_iomap_aops with callbacks that
will use generic iomap implementations
- Add ext4_iomap_aops to ext4_set_aops() when the flag is set

The following callbacks(read_folio(), readahead(), writepages()) are
provided as placeholders and will be implemented in later patches.

Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
Reviewed-by: Jan Kara <jack@xxxxxxx>

Hi Zhang, looks good to me. Just a questions below:

Hi, Ojaswin! Thank you for the review of this series.

---
fs/ext4/ext4.h | 7 +++++++
fs/ext4/inode.c | 32 ++++++++++++++++++++++++++++++++
2 files changed, 39 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 94283a991e5c..1e27d73d7427 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1972,6 +1972,7 @@ enum {
EXT4_STATE_FC_COMMITTING, /* Fast commit ongoing */
EXT4_STATE_FC_FLUSHING_DATA, /* Fast commit flushing data */
EXT4_STATE_ORPHAN_FILE, /* Inode orphaned in orphan file */
+ EXT4_STATE_BUFFERED_IOMAP, /* Inode use iomap for buffered IO */
};
#define EXT4_INODE_BIT_FNS(name, field, offset) \
@@ -2040,6 +2041,12 @@ static inline bool ext4_inode_orphan_tracked(struct inode *inode)
!list_empty(&EXT4_I(inode)->i_orphan);
}
+/* Whether the inode pass through the iomap infrastructure for buffered I/O */
+static inline bool ext4_inode_buffered_iomap(struct inode *inode)
+{
+ return ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP);
+}
+
/*
* Codes for operating systems
*/
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b1ef706987c3..178ac2be37b7 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3908,6 +3908,22 @@ const struct iomap_ops ext4_iomap_report_ops = {
.iomap_begin = ext4_iomap_begin_report,
};
+static int ext4_iomap_read_folio(struct file *file, struct folio *folio)
+{
+ return 0;
+}
+
+static void ext4_iomap_readahead(struct readahead_control *rac)
+{
+
+}
+
+static int ext4_iomap_writepages(struct address_space *mapping,
+ struct writeback_control *wbc)
+{
+ return 0;
+}
+
/*
* For data=journal mode, folio should be marked dirty only when it was
* writeably mapped. When that happens, it was already attached to the
@@ -3994,6 +4010,20 @@ static const struct address_space_operations ext4_da_aops = {
.swap_activate = ext4_iomap_swap_activate,
};
+static const struct address_space_operations ext4_iomap_aops = {
+ .read_folio = ext4_iomap_read_folio,
+ .readahead = ext4_iomap_readahead,
+ .writepages = ext4_iomap_writepages,
+ .dirty_folio = iomap_dirty_folio,
+ .bmap = ext4_bmap,
+ .invalidate_folio = iomap_invalidate_folio,
+ .release_folio = iomap_release_folio,
+ .migrate_folio = filemap_migrate_folio,
+ .is_partially_uptodate = iomap_is_partially_uptodate,
+ .error_remove_folio = generic_error_remove_folio,
+ .swap_activate = ext4_iomap_swap_activate,
+};

So one question, for ->release_folio() we are using
iomap_release_folio() instead of ext4_release_folio() here which doesnt
make the jbd2_journal_try_to_free_bufferes() call. IIUC this function
seems to be trying to clean up already checkpointed buffers.

I wanted to check if ->release_folio() can be called for folios with
ext4 metadata buffers? (from my limited understanding of
shrink_folio_list() -> filemap_release_folio() it seems we can) And if
it can be called, is it okay to skip the
jbd2_journal_try_to_free_buffers call?

Here, in ->release_folio(), folio->mapping points to inode->i_data (the
file's pagecache), not the block device's pagecache. ext4 metadata
resides in the block device's pagecache, which is at a different layer
than this release_folio callback. So we don't need to call
jbd2_journal_try_to_free_buffers() in the iomap path here.

Hi Yi,

Thanks for clarify and yes, thats what I was missing! So this
->release_folio() is only for data folios. So I guess the
jbd2_journal_try_to_free_buffers() is mostly to handle data=journal
case?

Yes, that's my understanding as well. Meanwhile, the comment for the
jbd2_journal_try_to_free_buffers() function looks quite outdated and
needs to be updated.

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 4885903bbd10..239bcf88ed1c 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -2139,38 +2139,23 @@ static void __jbd2_journal_unfile_buffer(struct journal_head *jh)
}

/**
- * jbd2_journal_try_to_free_buffers() - try to free page buffers.
+ * jbd2_journal_try_to_free_buffers() - try to free folio buffers.
* @journal: journal for operation
* @folio: Folio to detach data from.
*
- * For all the buffers on this page,
- * if they are fully written out ordered data, move them onto BUF_CLEAN
- * so try_to_free_buffers() can reap them.
+ * For each buffer_head on @folio, if the buffer has a journal_head but
+ * is not attached to a running or committing transaction, try to remove
+ * it from the checkpoint list. This is needed for data=journal mode
+ * where data buffers are journaled: once they are checkpointed, the
+ * journal_head can be detached and the buffer freed. If any buffer is
+ * still attached to a transaction, the folio cannot be released and we
+ * bail out. Otherwise we call try_to_free_buffers() to detach all
+ * buffer_heads from the folio.
*
- * This function returns non-zero if we wish try_to_free_buffers()
- * to be called. We do this if the page is releasable by try_to_free_buffers().
- * We also do it if the page has locked or dirty buffers and the caller wants
- * us to perform sync or async writeout.
+ * For data=ordered and writeback modes, data buffers never have
+ * journal_heads, so this degenerates to a plain try_to_free_buffers().
*
- * This complicates JBD locking somewhat. We aren't protected by the
- * BKL here. We wish to remove the buffer from its committing or
- * running transaction's ->t_datalist via __jbd2_journal_unfile_buffer.
- *
- * This may *change* the value of transaction_t->t_datalist, so anyone
- * who looks at t_datalist needs to lock against this function.
- *
- * Even worse, someone may be doing a jbd2_journal_dirty_data on this
- * buffer. So we need to lock against that. jbd2_journal_dirty_data()
- * will come out of the lock with the buffer dirty, which makes it
- * ineligible for release here.
- *
- * Who else is affected by this? hmm... Really the only contender
- * is do_get_write_access() - it could be looking at the buffer while
- * journal_try_to_free_buffer() is changing its state. But that
- * cannot happen because we never reallocate freed data as metadata
- * while the data is part of a transaction. Yes?
- *
- * Return false on failure, true on success
+ * Return: true if the folio's buffers were freed, false otherwise
*/
bool jbd2_journal_try_to_free_buffers(journal_t *journal, struct folio *folio)
{

Thanks,
Yi.