[PATCH v3 00/22] ext4: use iomap for regular file's buffered I/O path

From: Zhang Yi

Date: Tue Apr 21 2026 - 22:22:47 EST


From: Zhang Yi <yi.zhang@xxxxxxxxxx>

This series adds the iomap buffered I/O path support for regular files,
based on the latest upstream kernel. It implements the core iomap APIs
on ext4 and introduces the 'buffered_iomap' mount option to enable the
iomap buffered I/O path. It supports default features, default mount
options and bigalloc feature. However, it does not support online
defragmentation, inline data, fsverify, fscrypt, non-extent inodes, and
data=journal mode, it will fall to buffered_head I/O path automatically
if these features and options are used.

This iomap buffered I/O path is not enabled by default because the
preceding features are not supported. Users can explicitly enable or
disable it via 'buffered_iomap' and 'nobuffered_iomap' mount options.

Key notes
=========

1. Lock ordering difference

The lock ordering of folio lock and transaction start in the iomap
path is the opposite of that in the buffer_head path.

2. data=ordered mode is not used

Two main reasons:
a) The lock ordering of folio lock and transaction start for
data=ordered mode is opposite to the iomap path, which would cause
a deadlock.
b) The iomap writeback path does not support partial folio submission
(required by data=ordered mode when block size < folio size, and
it is currently handled by ext4_bio_write_folio()), which would
also cause a deadlock.

To replace data=ordered mode functionality:

- For append write: Always allocate unwritten extents (dioread_nolock
behavior) to prevent stale data exposure.

- For post-EOF partial block zeroing: Issue zeroing I/O immediately
and wait for completion before updating i_disksize. On ordered I/O
completion, set i_disksize = i_size to avoid lost updates in the
truncate up case. (Jan suggested).

- For online defragmentation: Not supported yet, needs further
consideration.

3. Always enable dioread_nolock

Two main reasons:
a) Since data=ordered mode cannot be used, allocating written blocks
directly would expose stale data.
b) To optimize writeback, we should allocate blocks based on writeback
length rather than per-folio mapping. Direct written allocation
would over-allocate blocks.

dioread_nolock has been the default mount option for many years, and
Jan pointed out that we may no longer need to disable it, so gradually
remove this mount option in the future.

Series structure
================

- Patch 01-03: Simplify truncate operations and prepare for conversion.
- Patch 04-18: Implement core iomap buffered read/write, writeback,
mmap, and partial block zeroing paths.
- Patch 19-22: Handle ordered I/O for zeroing post-EOF partial block.

Testing and Performance
=======================

Tested with xfstests-bld using -g auto, fast_commit, and 64k
configurations. No new regressions were observed.

For the special case of zeroing post-EOF partial block, I add a new
generic/790 to address this scenario.

https://lore.kernel.org/fstests/20260422015246.4132376-1-yi.zhang@xxxxxxxxxxxxxxx/

Performance tested with fio on a 150 GB memory-backed virtual machine
(no much difference compared to v2, so no update):

Buffered write (MiB/s)
===

bs write cache uncached write
bh iomap bh iomap
1k 423 403 36.3 57
4k 1067 1093 58.4 61
64k 4321 6488 869 1206
1M 4640 7378 3158 4818

Buffered read (MiB/s)
===

bs read hole read pre-cache read ondisk data
bh iomap bh iomap bh iomap
1k 635 643 661 653 605 602
4k 1987 2075 2128 2159 1761 1716
64k 6068 6267 9472 9545 4475 4451
1M 5471 6072 8657 9191 4405 4467

Large I/O write performance improved by approximately 30% to 50%.
Read performance showed no significant difference.

Changes sicne v2:
- Rebased on the latest upstream kernel (7.1-rc1).
- Added patches 01-03 to simplify truncate operations.
- Added patch 13 to fix incorrect did_zero parameter in
iomap_zero_range().
- Added patches 19-22 to handle ordered I/O for zeroing post-EOF
partial block.
- Minor code and comment optimizations.

Changes since v1:
- Rebase this series on linux-next 20260122.
- Refactor partial block zero range, stop passing handle to
ext4_block_truncate_page() and ext4_zero_partial_blocks(), and move
partial block zeroing operation outside an active journal transaction
to prevent potential deadlocks because of the lock ordering of folio
and transaction start.
- Clarify the lock ordering of folio lock and transaction start, update
the comments accordingly.
- Fix some issues related to fast commit, pollute post-EOF folio.
- Some minor code and comments optimizations.

v2: https://lore.kernel.org/linux-ext4/20260203062523.3869120-1-yi.zhang@xxxxxxxxxx/
v1: https://lore.kernel.org/linux-ext4/20241022111059.2566137-1-yi.zhang@xxxxxxxxxxxxxxx/
RFC v4: https://lore.kernel.org/linux-ext4/20240410142948.2817554-1-yi.zhang@xxxxxxxxxxxxxxx/
RFC v3: https://lore.kernel.org/linux-ext4/20240127015825.1608160-1-yi.zhang@xxxxxxxxxxxxxxx/
RFC v2: https://lore.kernel.org/linux-ext4/20240102123918.799062-1-yi.zhang@xxxxxxxxxxxxxxx/
RFC v1: https://lore.kernel.org/linux-ext4/20231123125121.4064694-1-yi.zhang@xxxxxxxxxxxxxxx/

Comments and suggestions are welcome!

Thanks,
Yi.


Zhang Yi (22):
ext4: simplify size updating in ext4_setattr()
ext4: factor out ext4_truncate_[up|down]()
ext4: simplify error handling in ext4_setattr()
ext4: add iomap address space operations for buffered I/O
ext4: implement buffered read path using iomap
ext4: pass out extent seq counter when mapping da blocks
ext4: do not use data=ordered mode for inodes using buffered iomap
path
ext4: implement buffered write path using iomap
ext4: implement writeback path using iomap
ext4: implement mmap path using iomap
iomap: correct the range of a partial dirty clear
iomap: support invalidating partial folios
iomap: fix incorrect did_zero setting in iomap_zero_iter()
ext4: implement partial block zero range path using iomap
ext4: add block mapping tracepoints for iomap buffered I/O path
ext4: disable online defrag when inode using iomap buffered I/O path
ext4: partially enable iomap for the buffered I/O path of regular
files
ext4: introduce a mount option for iomap buffered I/O path
ext4: submit zeroed post-EOF data immediately in the iomap buffered
I/O path
ext4: wait for ordered I/O in the iomap buffered I/O path
ext4: update i_disksize to i_size on ordered I/O completion
ext4: add tracepoints for ordered I/O in the iomap buffered I/O path

fs/ext4/ext4.h | 73 ++-
fs/ext4/ext4_jbd2.c | 1 +
fs/ext4/ext4_jbd2.h | 7 +-
fs/ext4/extents.c | 9 +-
fs/ext4/file.c | 20 +-
fs/ext4/ialloc.c | 1 +
fs/ext4/inode.c | 911 +++++++++++++++++++++++++++++++-----
fs/ext4/move_extent.c | 11 +
fs/ext4/page-io.c | 203 ++++++++
fs/ext4/super.c | 55 ++-
fs/iomap/buffered-io.c | 20 +-
include/trace/events/ext4.h | 142 ++++++
12 files changed, 1313 insertions(+), 140 deletions(-)

--
2.52.0