Re: [PATCH 1/4] mm: fix IOCB_DONTCACHE write performance with rate-limited writeback

From: IBM

Date: Thu Apr 02 2026 - 09:00:07 EST

Jeff Layton <jlayton@xxxxxxxxxx> writes:

> On Thu, 2026-04-02 at 10:13 +0530, Ritesh Harjani wrote:
>> Jeff Layton <jlayton@xxxxxxxxxx> writes:
>>
>> > IOCB_DONTCACHE calls filemap_flush_range() with nr_to_write=LONG_MAX
>> > on every write, which flushes all dirty pages in the written range.
>> > Under concurrent writers this creates severe serialization on the
>> > writeback submission path, causing throughput to collapse to ~47% of
>> > buffered I/O with multi-second tail latency.
>>
>> Yes, between concurrent writers, I agree with the theory.
>>
>>
>> > Even single-client
>> > sequential writes suffer: on a 512GB file with 256GB RAM, the
>> > aggressive flushing triggers dirty throttling that limits throughput
>> > to 575 MB/s vs 1442 MB/s with rate-limited writeback.
>>
>> I am not sure if this 2.5x performance penalty in a "single" sequential

Sorry my bad.. I mis-understood this 2.5x delta at first.

So in a single sequential write case, what this patch is mainly
improving is from unpatched RWF_DONTCACHE (1179 MB/s) to patched
RWF_DONTCACHE (1453 MB/s) = ~23% improvement.

So the below theory which I was talking about was from this delta
perspective i.e. comparing unpatched v/s patched RWF_DONTCACHE mode.

>> writer is due to throttling logic. On giving it some thoughts, I suspect
>> if this is because, the submission side and the completion side both
>> takes the xa_lock and hence could be contending on that.
>>
>> For e.g. since this patch skips doing the flush the second time, (note
>> that writeback is active when the same writer dirtied the page during
>> previous write), this allows the writer to do more work of writing data
>> to page cache pages, instead of waiting on the xa_lock which the
>> completion callback could be holding (folio_end_writeback() -> folio_end_dropbehind())
>>
>> If I see Peak Dirty data from the link you shared [1] in single writer case...
>>
>> Mode MB/s p50 (ms) p99 (ms) p99.9 (ms) Peak Dirty Peak Cache
>> dontcache (unpatched) 1179 3.2 103.3 170.9 14 MB 4.7 GB
>> dontcache (patched) 1453 5.4 43.8 57.4 36 GB 45 GB
>>
>> ... this too shows that the submission side is writing more dirty pages,
>> then the completion side able to write it...
>>
>> I suspect this contention (between submission and completion) could more
>> in IOCB_DONTCACHE case, since the completion side also removes the folio
>> from the page cache within the same xa_lock, which is not the same with
>> normal buffered writes.
>>
>> Maybe a perf callgraph showing the contention would be nicer thing to add
>> here [1] ;).
>>
>> [1]: https://markdownpastebin.com/?id=96249deb897a401ba32acbce05312dcc
>>
>
> That's an interesting point.
>
> The theory I've been operating on is that the flusher thread ends up
> squatting on the xa_lock for a while when memory gets tight, and that
> blocks other readers and writers. Staying ahead of the dirty limits and
> limiting the amount of flush work that each writer does alleviates
> contention for that lock and that's what improves the performance.
>

That's right for comparison between buffered write against RWF_DONTCACHE.
But what I meant in above was for the improvement from 1179 MB/s to 1453
MB/s could be accounted to less contention on xa_lock on patched version
v/s unpatched version for single write sequential testcase.

> You're right though. I'll plan to play around with perf and see if I
> can confirm the theory.
>

Yes, thanks, that will be nice to have!

-ritesh