[PATCH v4 00/23] ext4: use iomap for regular file's buffered I/O path
From: Zhang Yi
Date: Mon May 11 2026 - 03:34:53 EST
From: Zhang Yi <yi.zhang@xxxxxxxxxx>
Hi,
This version is a small revision of v3 with no design changes. It fixes
some issues pointed out by Jan and Sashiko, and adds numerous comments
to clarify functionality and key considerations. You can get commits
here:
https://github.com/zhangyi089/linux/commits/ext4_buffered_iomap_v4/
Original Cover-letter:
===
This series adds the iomap buffered I/O path support for regular files,
based on the latest upstream kernel. It implements the core iomap APIs
on ext4 and introduces the 'buffered_iomap' mount option to enable the
iomap buffered I/O path. It supports default features, default mount
options and bigalloc feature. However, it does not support online
defragmentation, inline data, fsverify, fscrypt, non-extent inodes, and
data=journal mode, it will fall to buffered_head I/O path automatically
if these features and options are used.
This iomap buffered I/O path is not enabled by default because the
preceding features are not supported. Users can explicitly enable or
disable it via 'buffered_iomap' and 'nobuffered_iomap' mount options.
Key notes
=========
1. Lock ordering difference
The lock ordering of folio lock and transaction start in the iomap
path is the opposite of that in the buffer_head path.
2. data=ordered mode is not used
Two main reasons:
a) The lock ordering of folio lock and transaction start for
data=ordered mode is opposite to the iomap path, which would cause
a deadlock.
b) The iomap writeback path does not support partial folio submission
(required by data=ordered mode when block size < folio size, and
it is currently handled by ext4_bio_write_folio()), which would
also cause a deadlock.
To replace data=ordered mode functionality:
- For append write: Always allocate unwritten extents (dioread_nolock
behavior) to prevent stale data exposure.
- For post-EOF partial block zeroing: Issue zeroing I/O immediately
and asynchronously or synchronously wait for completion before
updating i_disksize. On ordered I/O completion, set i_disksize to
i_size to avoid lost updates in the truncate up and append
fallocate cases. (Jan suggested).
- For online defragmentation: Not supported yet, needs further
consideration.
3. Always enable dioread_nolock
Two main reasons:
a) Since data=ordered mode cannot be used, allocating written blocks
directly would expose stale data.
b) To optimize writeback, we should allocate blocks based on writeback
length rather than per-folio mapping. Direct written allocation
would over-allocate blocks.
dioread_nolock has been the default mount option for many years, and
Jan pointed out that we may no longer need to disable it, so gradually
remove this mount option in the future.
Series structure
================
- Patch 01-03: Simplify truncate operations and prepare for conversion.
- Patch 04-16: Implement core iomap buffered read/write, writeback,
mmap, and partial block zeroing paths.
- Patch 17-21: Handle ordered I/O for zeroing post-EOF partial block.
- Patch 22-23: Enable iomap buffered I/O path.
Testing and Performance
=======================
Tested with xfstests-bld using -g auto, fast_commit, and 64k
configurations. No new regressions were observed.
For the special case of zeroing post-EOF partial block, I add a new
generic/790 to address this scenario.
https://lore.kernel.org/fstests/20260428085750.1072612-1-yi.zhang@xxxxxxxxxxxxxxx/
Performance tested with fio on a 150 GB memory-backed virtual machine
(no much difference compared to v2 and v3, so no update):
Buffered write (MiB/s)
===
bs write cache uncached write
bh iomap bh iomap
1k 423 403 36.3 57
4k 1067 1093 58.4 61
64k 4321 6488 869 1206
1M 4640 7378 3158 4818
Buffered read (MiB/s)
===
bs read hole read pre-cache read ondisk data
bh iomap bh iomap bh iomap
1k 635 643 661 653 605 602
4k 1987 2075 2128 2159 1761 1716
64k 6068 6267 9472 9545 4475 4451
1M 5471 6072 8657 9191 4405 4467
Large I/O write performance improved by approximately 30% to 50%.
Read performance showed no significant difference.
Changes since v3:
- Rebased on the latest upstream kernel.
- Improve commit messages for patches 07-23 to clarify functionality
and key considerations.
- Move the patches that enables IOMAP to the end of this series.
- Patch 02: Move ext4_set_inode_size() declarations from ext4.h into
inode.c, move truncate_pagecache() and ext4_truncate() to
ext4_truncate_down() as Jan suggested.
- Patch 08: Add check for non-extent inodes in the non-delalloc write
path, and clarify the reason why we don't need to truncate blocks on
short writes. (Pointed out by sashiko)
- Patch 09: Fix the issue where DATA_ERR_ABORT fails to work in
overwrite scenarios. Replace iomap_finish_ioends() with
iomap_finish_ioend() during end_io to prevent might_sleep() being
called in interrupt context. (Pointed out by sashiko)
- Patch 11: Fix underflow of the nr_blks variable. (Pointed out by
sashiko)
- Patch 17: Factor out ext4_iomap_submit_zero_block() helper to handle
ordered mode after zeroing a post-EOF partial block in the iomap
path, also add comments.
- Patch 18: Fix off-by-one in ext4_iomap_wb_ordered_wait() and clarify
why a single i_ordered_len tracker suffices. (Pointed out by sashiko)
- Patch 19: Fix an issue where the correct file size may be lost due to
a missing memory barrier. (Pointed out by sashiko)
- Patch 20: Change the logic for waiting on ordered I/Os in the insert
range and collapse range from asynchronous to synchronous.
- Patch 21: Allow per-inode journal mode changes but disallow per-inode
extent type changes, add comments of restrictions on using iomap.
Changes since v2:
- Rebased on the latest upstream kernel (7.1-rc1).
- Added patches 01-03 to simplify truncate operations.
- Added patch 13 to fix incorrect did_zero parameter in
iomap_zero_range().
- Added patches 19-22 to handle ordered I/O for zeroing post-EOF
partial block.
- Minor code and comment optimizations.
Changes since v1:
- Rebase this series on linux-next 20260122.
- Refactor partial block zero range, stop passing handle to
ext4_block_truncate_page() and ext4_zero_partial_blocks(), and move
partial block zeroing operation outside an active journal transaction
to prevent potential deadlocks because of the lock ordering of folio
and transaction start.
- Clarify the lock ordering of folio lock and transaction start, update
the comments accordingly.
- Fix some issues related to fast commit, pollute post-EOF folio.
- Some minor code and comments optimizations.
v3: https://lore.kernel.org/linux-ext4/20260422021042.4157510-1-yi.zhang@xxxxxxxxxxxxxxx/
v2: https://lore.kernel.org/linux-ext4/20260203062523.3869120-1-yi.zhang@xxxxxxxxxx/
v1: https://lore.kernel.org/linux-ext4/20241022111059.2566137-1-yi.zhang@xxxxxxxxxxxxxxx/
RFC v4: https://lore.kernel.org/linux-ext4/20240410142948.2817554-1-yi.zhang@xxxxxxxxxxxxxxx/
RFC v3: https://lore.kernel.org/linux-ext4/20240127015825.1608160-1-yi.zhang@xxxxxxxxxxxxxxx/
RFC v2: https://lore.kernel.org/linux-ext4/20240102123918.799062-1-yi.zhang@xxxxxxxxxxxxxxx/
RFC v1: https://lore.kernel.org/linux-ext4/20231123125121.4064694-1-yi.zhang@xxxxxxxxxxxxxxx/
Comments and suggestions are welcome!
Thanks,
Yi.
Zhang Yi (23):
ext4: simplify size updating in ext4_setattr()
ext4: factor out ext4_truncate_[up|down]()
ext4: simplify error handling in ext4_setattr()
ext4: add iomap address space operations for buffered I/O
ext4: implement buffered read path using iomap
ext4: pass out extent seq counter when mapping da blocks
ext4: do not use data=ordered mode for inodes using buffered iomap
path
ext4: implement buffered write path using iomap
ext4: implement writeback path using iomap
ext4: implement mmap path using iomap
iomap: correct the range of a partial dirty clear
iomap: support invalidating partial folios
iomap: fix incorrect did_zero setting in iomap_zero_iter()
ext4: implement partial block zero range path using iomap
ext4: add block mapping tracepoints for iomap buffered I/O path
ext4: disable online defrag when inode using iomap buffered I/O path
ext4: submit zeroed post-EOF data immediately in the iomap buffered
I/O path
ext4: wait for ordered I/O in the iomap buffered I/O path
ext4: update i_disksize to i_size on ordered I/O completion
ext4: wait for ordered I/O to complete during insert and collapse
range
ext4: add tracepoints for ordered I/O in the iomap buffered I/O path
ext4: partially enable iomap for the buffered I/O path of regular
files
ext4: introduce a mount option for iomap buffered I/O path
fs/ext4/ext4.h | 57 +-
fs/ext4/ext4_jbd2.c | 8 +-
fs/ext4/ext4_jbd2.h | 7 +-
fs/ext4/extents.c | 18 +
fs/ext4/file.c | 20 +-
fs/ext4/ialloc.c | 1 +
fs/ext4/inode.c | 1040 ++++++++++++++++++++++++++++++-----
fs/ext4/migrate.c | 2 +
fs/ext4/move_extent.c | 11 +
fs/ext4/page-io.c | 210 +++++++
fs/ext4/super.c | 55 +-
fs/iomap/buffered-io.c | 22 +-
fs/iomap/ioend.c | 3 +-
include/linux/iomap.h | 1 +
include/trace/events/ext4.h | 142 +++++
15 files changed, 1446 insertions(+), 151 deletions(-)
--
2.52.0