Re: [PATCH v2 0/3] mm: improve write performance with RWF_DONTCACHE
From: Jeff Layton
Date: Wed Apr 08 2026 - 14:47:12 EST
On Wed, 2026-04-08 at 10:25 -0400, Jeff Layton wrote:
> This version adopts Christoph's suggest to have generic_write_sync()
> kick the flusher thread for the superblock instead of initiating
> writeback directly. This seems to perform as well or better in most
> cases than doing the writeback directly.
>
> Here are results on XFS, both local and exported via knfsd:
>
> nfsd: https://markdownpastebin.com/?id=1884b9487c404ff4b7094ed41cc48f05
> xfs: https://markdownpastebin.com/?id=3c6b262182184b25b7d58fb211374475
>
> Ritesh had also asked about getting perf lock traces to confirm the
> source of the contention. I did that (and I can post them if you like),
> but the results from the unpatched dontcache runs didn't point out any
> specific lock contention. That leads me to believe that the bottlenecks
> were from normal queueing work, and not contention for the xa_lock after
> all.
>
> Kicking the writeback thread seems to be a clear improvement over the
> status quo in my testing, but I do wonder if having dontcache writes
> spamming writeback for the whole bdi is the best idea.
>
> I'm benchmarking out a patch that has the flusher do a
> writeback_single_inode() for the work. I don't expect it to perform
> measurably better in this testing, but it would better isolate the
> DONTCACHE writeback behavior to just those inodes touched by DONTCACHE
> writes.
>
> Assuming that looks OK, I'll probably send a v3. Original cover letter
> from v1 follows:
>
Actually, that version regressed performance in a couple of cases. I think v2 is probably the best approach, on balance. Maybe we can get this into -next so that it can make v7.2?
Here's the comparison between this version and a writeback_single_inode() flush version:
------------------8<-----------------------
● Comparing dontcache numbers against the previous whole-BDI flusher kernel (from /tmp/dontcache-local-4way-flusher.md):
Per-Inode vs Whole-BDI Flusher — DONTCACHE on Local XFS
Single-Client Writes
┌──────────────────┬───────────┬───────────┬─────────────┐
│ Benchmark │ Whole-BDI │ Per-Inode │ Change │
├──────────────────┼───────────┼───────────┼─────────────┤
│ Seq write MB/s │ 1450 │ 1438 │ -1% (noise) │
├──────────────────┼───────────┼───────────┼─────────────┤
│ Seq write p99.9 │ 23.5 ms │ 23.5 ms │ identical │
├──────────────────┼───────────┼───────────┼─────────────┤
│ Rand write MB/s │ 363 │ 286 │ -21% │
├──────────────────┼───────────┼───────────┼─────────────┤
│ Rand write p99.9 │ 1.8 ms │ 16.7 ms │ regression │
└──────────────────┴───────────┴───────────┴─────────────┘
Seq write is identical. Rand write regressed — the whole-BDI flusher batched all dirty pages in one pass with writeback_sb_inodes() under a single blk_plug, while per-inode write_inode_now() loses that batching.
Single-Client Reads
┌────────────────┬───────────┬───────────┬────────┐
│ Benchmark │ Whole-BDI │ Per-Inode │ Change │
├────────────────┼───────────┼───────────┼────────┤
│ Seq read MB/s │ 2950 │ 2350 │ -20% │
├────────────────┼───────────┼───────────┼────────┤
│ Rand read MB/s │ 651 │ 519 │ -20% │
└────────────────┴───────────┴───────────┴────────┘
Reads shouldn't be affected by writeback path changes. Buffered reads also dropped (2888 → 2331), suggesting different system conditions between runs rather than a per-inode regression.
Multi-Writer (Scenario A)
┌────────────────┬───────────┬───────────┬────────────┐
│ Metric │ Whole-BDI │ Per-Inode │ Change │
├────────────────┼───────────┼───────────┼────────────┤
│ Aggregate MB/s │ 1478 │ 999 │ -32% │
├────────────────┼───────────┼───────────┼────────────┤
│ p99.9 │ 46 ms │ 77 ms │ -67% worse │
└────────────────┴───────────┴───────────┴────────────┘
This is the biggest regression. With whole-BDI, the flusher did one batched pass through all dirty inodes via writeback_sb_inodes(). With per-inode, each of 4 writers queues a separate work item processed serially by write_inode_now() — losing the batch I/O merging benefit.
Scenario C & D (Noisy Neighbor)
┌─────────────────────────┬───────────┬───────────┬─────────────┐
│ Metric │ Whole-BDI │ Per-Inode │ Change │
├─────────────────────────┼───────────┼───────────┼─────────────┤
│ Scenario C writer │ 1468 │ 1386 │ -6% │
├─────────────────────────┼───────────┼───────────┼─────────────┤
│ Scenario C readers │ 18.7 MB/s │ 18.7 MB/s │ identical │
├─────────────────────────┼───────────┼───────────┼─────────────┤
│ Scenario D writer │ 1472 │ 1467 │ identical │
├─────────────────────────┼───────────┼───────────┼─────────────┤
│ Scenario D readers │ 496 MB/s │ 507 MB/s │ +2% │
├─────────────────────────┼───────────┼───────────┼─────────────┤
│ Scenario D reader p99.9 │ 440 us │ 358 us │ +19% better │
└─────────────────────────┴───────────┴───────────┴─────────────┘
Mixed-mode (Scenario D) is the intended production case and it's essentially identical or slightly better — per-inode writeback creates less device contention for buffered readers.
Summary
The per-inode approach is neutral-to-slightly-better for the production scenario (Scenario D), but regresses on multi-writer and random write workloads. The core issue is loss of I/O batching
— writeback_sb_inodes() processes all dirty inodes in one blk_plug'd pass, while per-inode write_inode_now() calls are processed one at a time. The read regressions likely reflect different
system conditions since buffered/direct reads also dropped ~20%.
--
Jeff Layton <jlayton@xxxxxxxxxx>