Re: [PATCH v2 2/2] ext4: avoid tail write_begin walk for uptodate folios

From: Jan Kara

Date: Mon Jun 08 2026 - 10:53:09 EST


On Mon 08-06-26 20:01:31, Jia Zhu wrote:
> Ext4 buffered writes into large folios also pay a full buffer_head
> walk in ext4_block_write_begin(). For a small overwrite of an existing
> cached folio, the folio is already uptodate and the write only needs to
> prepare the buffers through the written range. Walking the suffix still
> makes the write_begin cost proportional to the folio size.
>
> Before ext4 enabled large folios for regular files, the same loop was
> bounded by a single page of buffers. That commit made the existing
> full-folio walk visible as a regression for cached small overwrites.
>
> The suffix walk is needed for non-uptodate folios, where ext4 may have
> to submit reads for partial blocks, preserve new-buffer cleanup, and run
> error zeroing. Keep those folios on the old full walk.
>
> For already-uptodate folios, keep the walk starting at the first buffer
> rather than seeking directly to from. This preserves the existing prefix
> buffer state handling. Stop once block_start reaches the end of the
> write range, because the skipped suffix would only repeat the
> outside-range uptodate handling for buffers beyond @to.
>
> On current master, the libMicro ext4 large-folio overwrite test shows
> the following full-series result. Results are median usecs/call over 10
> runs, lower is better:
>
> case nofix this series improvement
> write_u1k 1.418 0.3405 76.0%
> write_u10k 1.887 0.4175 77.9%
> pwrite_u1k 1.6775 0.3390 79.8%
> pwrite_u10k 1.9035 0.4130 78.3%
>
> Fixes: 7ac67301e82f0 ("ext4: enable large folio for regular file")
> Signed-off-by: Jia Zhu <zhujia.zj@xxxxxxxxxxxxx>

Looks good, just one simplification suggestion:

> @@ -1193,13 +1194,14 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
> head = create_empty_buffers(folio, blocksize, 0);
> block = EXT4_PG_TO_LBLK(inode, folio->index);
>
> - for (bh = head, block_start = 0; bh != head || !block_start;
> + for (bh = head, block_start = 0;
> + (bh != head || !block_start) &&
> + (!folio_uptodate || block_start < to);

You simplify this condition to:

block_start < to || (!folio_uptodate && bh != head)

With this updated feel free to add:

Reviewed-by: Jan Kara <jack@xxxxxxx>

Honza


> block++, block_start = block_end, bh = bh->b_this_page) {
> block_end = block_start + blocksize;
> if (block_end <= from || block_start >= to) {
> - if (folio_test_uptodate(folio)) {
> + if (folio_uptodate)
> set_buffer_uptodate(bh);
> - }
> continue;
> }
> if (WARN_ON_ONCE(buffer_new(bh)))
> @@ -1220,7 +1222,7 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
> if (should_journal_data)
> do_journal_get_write_access(handle,
> inode, bh);
> - if (folio_test_uptodate(folio)) {
> + if (folio_uptodate) {
> /*
> * Unlike __block_write_begin() we leave
> * dirtying of new uptodate buffers to
> @@ -1237,7 +1239,7 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
> continue;
> }
> }
> - if (folio_test_uptodate(folio)) {
> + if (folio_uptodate) {
> set_buffer_uptodate(bh);
> continue;
> }
> --
> 2.20.1
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR