Re: [PATCH 1/4] mm: fix IOCB_DONTCACHE write performance with rate-limited writeback
From: Jeff Layton
Date: Thu Apr 02 2026 - 08:29:05 EST
On Wed, 2026-04-01 at 22:21 -0700, Christoph Hellwig wrote:
> On Wed, Apr 01, 2026 at 03:10:58PM -0400, Jeff Layton wrote:
> > IOCB_DONTCACHE calls filemap_flush_range() with nr_to_write=LONG_MAX
> > on every write, which flushes all dirty pages in the written range.
> >
> > Under concurrent writers this creates severe serialization on the
> > writeback submission path, causing throughput to collapse to ~47% of
> > buffered I/O with multi-second tail latency. Even single-client
> > sequential writes suffer: on a 512GB file with 256GB RAM, the
> > aggressive flushing triggers dirty throttling that limits throughput
> > to 575 MB/s vs 1442 MB/s with rate-limited writeback.
>
> I'm not sure the first how you think the first paragraph relate to
> the second.
>
The belief is that under heavy parallel write workload on the same
inode, the writers all end up stacking up on the mapping's xa_lock.
However as Ritesh points out, I should probably confirm that with perf.
> > Replace the filemap_flush_range() call in generic_write_sync() with a
> > new filemap_dontcache_writeback_range() that uses two rate-limiting
> > mechanisms:
> >
> > 1. Skip-if-busy: check mapping_tagged(PAGECACHE_TAG_WRITEBACK)
> > before flushing. If writeback is already in progress on the
> > mapping, skip the flush entirely. This eliminates writeback
> > submission contention between concurrent writers.
>
> Makes sense.
>
> > 2. Proportional cap: when flushing does occur, cap nr_to_write to
> > the number of pages just written. This prevents any single
> > write from triggering a large flush that would starve concurrent
> > readers.
>
> This doesn't make any sense at all.
> filemap_flush_range/filemap_writeback always caps the number of written
> pages to the range passed in. What do you think is the change here?
>
I had some earlier results that indicated that this did help. It's
possible they were bogus though. I'll recheck that and get back to you.
> > + return filemap_writeback(mapping, start, end, WB_SYNC_NONE, &nr,
> > + WB_REASON_BACKGROUND);
>
> filemap_writeback only has 5 arguments in any tree I've looked at
> including linux-next.
>
I think this was a bad merge on my part. Mea culpa. The version in the
"dontcache" branch of my tree should be correct.
Thanks for the review!
--
Jeff Layton <jlayton@xxxxxxxxxx>