Re: [PATCH 2/4] mm: add atomic flush guard for IOCB_DONTCACHE writeback

From: Jeff Layton

Date: Mon Apr 06 2026 - 09:39:05 EST

On Sun, 2026-04-05 at 22:49 -0700, Christoph Hellwig wrote:
> On Thu, Apr 02, 2026 at 08:49:45AM -0400, Jeff Layton wrote:
> > > Have you considered stopping to do in-caller writeback for
> > > IOCB_DONTCACHE vs just leaving it to the writeback daeon?
> > >
> > > Either by totally disabling the writeback and just leaving the
> > > dropbehind bit, or by queuing up wb_writeback_work instances for
> > > the ranges, or by just increasing the pressure for the writeback
> > > daemon. Note that with all schemes including the one in this patch
> > > we might eventually run into writeback scalability limits, which
> > > will require multiple writeback workers.
> >
> > I did test a "dropbehind" mode that just set the dropbehind bit without
> > doing the flush at the end of the write. It was better than stock
> > dontcache but the tail latencies were still pretty bad.
> >
> > I think having each writer do some writeback submission work makes a
> > lot of sense. It helps keep the dirty pages below the dirty thresholds
> > and doesn't seem to tax each writing task _too_ much. The trick is
> > avoiding lock contention while doing it.
>
> Well, an any time you hit a shared resources from multiple threads you
> create that lock contention. Which is why in file system and writeback
> land we've moved away from random user processes hitting both data and
> metadata (e.g. XFS AIL) writeback as it leads to these scalability
> issues. At some point we might run out of steam in a single thread,
> although so far that's mostly been because it does stupid things
> (e.g. writeback on file systems doing complex allocator stuff).
>
> > I think what would be ideal would be to have some (lockless) mechanism
> > to say "there is enough data touched by the range just written to kick
> > off a write that's a suitable size for the backing store". Each writer
> > could check that and then kick off writeback for an approprite range.
>
> And that is called the writeback thread. So what we should do there
> is to make sure we queue up writeback on it for each dontcache write.
> Initially queuing up a wb_writeback_work for each range might be first
> approximation, although we should probably find a way to just increase
> a threshold if going down that road.
>

Ok, I like that idea. I'll give that a shot and see how it does. I'll
note that there is no way to specify an inode or range (yet) in
wb_writeback_work().

Do you think it's sufficient to just call something like
wakeup_flusher_threads_bdi() after every RWF_DONTCACHE write, or should
I extend wb_writeback_work to allow for doing work on a range within a
single inode?

> > I think this even could be beneficial in the normal buffered write
> > codepath too.
>
> Yes, we've had lots of observation that the current 30s timeout is
> actively harmful. Especially on SSDs, but even on HDD just keeping
> the active might make sense.

--
Jeff Layton <jlayton@xxxxxxxxxx>