Re: [PATCH 2/4] mm: add atomic flush guard for IOCB_DONTCACHE writeback
From: Christoph Hellwig
Date: Mon Apr 06 2026 - 01:49:44 EST
On Thu, Apr 02, 2026 at 08:49:45AM -0400, Jeff Layton wrote:
> > Have you considered stopping to do in-caller writeback for
> > IOCB_DONTCACHE vs just leaving it to the writeback daeon?
> >
> > Either by totally disabling the writeback and just leaving the
> > dropbehind bit, or by queuing up wb_writeback_work instances for
> > the ranges, or by just increasing the pressure for the writeback
> > daemon. Note that with all schemes including the one in this patch
> > we might eventually run into writeback scalability limits, which
> > will require multiple writeback workers.
>
> I did test a "dropbehind" mode that just set the dropbehind bit without
> doing the flush at the end of the write. It was better than stock
> dontcache but the tail latencies were still pretty bad.
>
> I think having each writer do some writeback submission work makes a
> lot of sense. It helps keep the dirty pages below the dirty thresholds
> and doesn't seem to tax each writing task _too_ much. The trick is
> avoiding lock contention while doing it.
Well, an any time you hit a shared resources from multiple threads you
create that lock contention. Which is why in file system and writeback
land we've moved away from random user processes hitting both data and
metadata (e.g. XFS AIL) writeback as it leads to these scalability
issues. At some point we might run out of steam in a single thread,
although so far that's mostly been because it does stupid things
(e.g. writeback on file systems doing complex allocator stuff).
> I think what would be ideal would be to have some (lockless) mechanism
> to say "there is enough data touched by the range just written to kick
> off a write that's a suitable size for the backing store". Each writer
> could check that and then kick off writeback for an approprite range.
And that is called the writeback thread. So what we should do there
is to make sure we queue up writeback on it for each dontcache write.
Initially queuing up a wb_writeback_work for each range might be first
approximation, although we should probably find a way to just increase
a threshold if going down that road.
> I think this even could be beneficial in the normal buffered write
> codepath too.
Yes, we've had lots of observation that the current 30s timeout is
actively harmful. Especially on SSDs, but even on HDD just keeping
the active might make sense.