Re: [PATCH v4 2/4] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking

From: Jens Axboe

Date: Fri May 01 2026 - 12:44:33 EST

On 5/1/26 3:49 AM, Jeff Layton wrote:
> The IOCB_DONTCACHE writeback path in generic_write_sync() calls
> filemap_flush_range() on every write, submitting writeback inline in
> the writer's context. Perf lock contention profiling shows the
> performance problem is not lock contention but the writeback submission
> work itself — walking the page tree and submitting I/O blocks the writer
> for milliseconds, inflating p99.9 latency from 23ms (buffered) to 93ms
> (dontcache).
>
> Replace the inline filemap_flush_range() call with a flusher kick that
> drains dirty pages in the background. This moves writeback submission
> completely off the writer's hot path.
>
> To avoid flushing unrelated buffered dirty data, add a dedicated
> WB_start_dontcache bit and wb_check_start_dontcache() handler that uses
> the per-wb WB_DONTCACHE_DIRTY counter to determine how many pages to
> write back. The flusher writes back that many pages from the oldest dirty
> inodes (not restricted to dontcache-specific inodes). This helps
> preserve I/O batching while limiting the scope of expedited writeback.
>
> Like WB_start_all, the WB_start_dontcache bit coalesces multiple
> DONTCACHE writes into a single flusher wakeup without per-write
> allocations.
>
> Also add WB_REASON_DONTCACHE as a new writeback reason for tracing
> visibility, and target the correct cgroup writeback domain via
> unlocked_inode_to_wb_begin().
>
> dontcache-bench results (same host, T6F_SKL_1920GBF, 251 GiB RAM,
> xfs on NVMe, fio io_uring):
>
> Buffered and direct I/O paths are unaffected by this patchset. All
> improvements are confined to the dontcache path:
>
> Single-stream throughput (MB/s):
> Before After Change
> seq-write/dontcache 298 897 +201%
> rand-write/dontcache 131 236 +80%
>
> Tail latency improvements (seq-write/dontcache):
> p99: 135,266 us -> 23,986 us (-82%)
> p99.9: 8,925,479 us -> 28,443 us (-99.7%)
>
> Multi-writer (4 jobs, sequential write):
> Before After Change
> dontcache aggregate (MB/s) 2,529 4,532 +79%
> dontcache p99 (us) 8,553 1,002 -88%
> dontcache p99.9 (us) 109,314 1,057 -99%
>
> Dontcache multi-writer throughput now matches buffered (4,532 vs
> 4,616 MB/s).
>
> 32-file write (Axboe test):
> Before After Change
> dontcache aggregate (MB/s) 1,548 3,499 +126%
> dontcache p99 (us) 10,170 602 -94%
> Peak dirty pages (MB) 1,837 213 -88%
>
> Dontcache now reaches 81% of buffered throughput (was 35%).
>
> Competing writers (dontcache vs buffered, separate files):
> Before After
> buffered writer 868 433 MB/s
> dontcache writer 415 433 MB/s
> Aggregate 1,284 866 MB/s
>
> Previously the buffered writer starved the dontcache writer 2:1.
> With per-bdi_writeback tracking, both writers now receive equal
> bandwidth. The aggregate matches the buffered-vs-buffered baseline
> (863 MB/s), indicating fair sharing regardless of I/O mode.
>
> The dontcache writer's p99.9 latency collapsed from 119 ms to
> 33 ms (-73%), eliminating the severe periodic stalls seen in the
> baseline. Both writers now share identical latency profiles,
> matching the buffered-vs-buffered pattern.
>
> The per-bdi_writeback dirty tracking dramatically reduces peak dirty
> pages in dontcache workloads, with the 32-file test dropping from
> 1.8 GB to 213 MB. Dontcache sequential write throughput triples and
> multi-writer throughput reaches parity with buffered I/O, with tail
> latencies collapsing by 1-2 orders of magnitude.

I like this, this is the better way to kick off the writeback.

Reviewed-by: Jens Axboe <axboe@xxxxxxxxx>

--
Jens Axboe