Re: [PATCHSET v6 0/12] Uncached buffered IO

From: Darrick J. Wong
Date: Fri Dec 06 2024 - 12:37:59 EST


On Tue, Dec 03, 2024 at 08:31:36AM -0700, Jens Axboe wrote:
> Hi,
>
> 5 years ago I posted patches adding support for RWF_UNCACHED, as a way
> to do buffered IO that isn't page cache persistent. The approach back
> then was to have private pages for IO, and then get rid of them once IO
> was done. But that then runs into all the issues that O_DIRECT has, in
> terms of synchronizing with the page cache.
>
> So here's a new approach to the same concent, but using the page cache
> as synchronization. That makes RWF_UNCACHED less special, in that it's
> just page cache IO, except it prunes the ranges once IO is completed.
>
> Why do this, you may ask? The tldr is that device speeds are only
> getting faster, while reclaim is not. Doing normal buffered IO can be
> very unpredictable, and suck up a lot of resources on the reclaim side.
> This leads people to use O_DIRECT as a work-around, which has its own
> set of restrictions in terms of size, offset, and length of IO. It's
> also inherently synchronous, and now you need async IO as well. While
> the latter isn't necessarily a big problem as we have good options
> available there, it also should not be a requirement when all you want
> to do is read or write some data without caching.
>
> Even on desktop type systems, a normal NVMe device can fill the entire
> page cache in seconds. On the big system I used for testing, there's a
> lot more RAM, but also a lot more devices. As can be seen in some of the
> results in the following patches, you can still fill RAM in seconds even
> when there's 1TB of it. Hence this problem isn't solely a "big
> hyperscaler system" issue, it's common across the board.
>
> Common for both reads and writes with RWF_UNCACHED is that they use the
> page cache for IO. Reads work just like a normal buffered read would,
> with the only exception being that the touched ranges will get pruned
> after data has been copied. For writes, the ranges will get writeback
> kicked off before the syscall returns, and then writeback completion
> will prune the range. Hence writes aren't synchronous, and it's easy to
> pipeline writes using RWF_UNCACHED. Folios that aren't instantiated by
> RWF_UNCACHED IO are left untouched. This means you that uncached IO
> will take advantage of the page cache for uptodate data, but not leave
> anything it instantiated/created in cache.
>
> File systems need to support this. The patches add support for the
> generic filemap helpers, and for iomap. Then ext4 and XFS are marked as
> supporting it. The last patch adds support for btrfs as well, lightly
> tested. The read side is already done by filemap, only the write side
> needs a bit of help. The amount of code here is really trivial, and the
> only reason the fs opt-in is necessary is to have an RWF_UNCACHED IO
> return -EOPNOTSUPP just in case the fs doesn't use either the generic
> paths or iomap. Adding "support" to other file systems should be
> trivial, most of the time just a one-liner adding FOP_UNCACHED to the
> fop_flags in the file_operations struct.
>
> Performance results are in patch 8 for reads and patch 10 for writes,
> with the tldr being that I see about a 65% improvement in performance
> for both, with fully predictable IO times. CPU reduction is substantial
> as well, with no kswapd activity at all for reclaim when using uncached
> IO.
>
> Using it from applications is trivial - just set RWF_UNCACHED for the
> read or write, using pwritev2(2) or preadv2(2). For io_uring, same
> thing, just set RWF_UNCACHED in sqe->rw_flags for a buffered read/write
> operation. And that's it.
>
> Patches 1..7 are just prep patches, and should have no functional
> changes at all. Patch 8 adds support for the filemap path for
> RWF_UNCACHED reads, patch 11 adds support for filemap RWF_UNCACHED
> writes. In the below mentioned branch, there are then patches to
> adopt uncached reads and writes for ext4, xfs, and btrfs.
>
> Passes full xfstests and fsx overnight runs, no issues observed. That
> includes the vm running the testing also using RWF_UNCACHED on the host.
> I'll post fsstress and fsx patches for RWF_UNCACHED separately. As far
> as I'm concerned, no further work needs doing here.
>
> And git tree for the patches is here:
>
> https://git.kernel.dk/cgit/linux/log/?h=buffered-uncached.8

Oh good, I much prefer browsing git branches these days. :)

* mm/filemap: change filemap_create_folio() to take a struct kiocb
* mm/readahead: add folio allocation helper
* mm: add PG_uncached page flag
* mm/readahead: add readahead_control->uncached member
* mm/filemap: use page_cache_sync_ra() to kick off read-ahead
* mm/truncate: add folio_unmap_invalidate() helper

The mm patches look ok to me, but I think you ought to get at least an
ack from willy since they're largely pagecache changes.

* fs: add RWF_UNCACHED iocb and FOP_UNCACHED file_operations flag

See more detailed reply in the thread.

* mm/filemap: add read support for RWF_UNCACHED

Looks cleaner now that we don't even unmap the page if it's dirty.

* mm/filemap: drop uncached pages when writeback completes
* mm/filemap: add filemap_fdatawrite_range_kick() helper
* mm/filemap: make buffered writes work with RWF_UNCACHED

See more detailed reply in the thread.

* mm: add FGP_UNCACHED folio creation flag

I appreciate that !UNCACHED callers of __filemap_get_folio now clear the
uncached bit if it's set.

Now I proceed into the rest of your branch, because I felt like it:

* ext4: add RWF_UNCACHED write support

(Dunno about the WARN_ON removals in this patch, but this is really
Ted's call anyway).

* iomap: make buffered writes work with RWF_UNCACHED

The commit message references a "iocb_uncached_write" but I don't find
any such function in the extended patchset? Otherwise this looks ready
to me. Thanks for changing it only to set uncached if we're actually
creating a folio, and not just returning one that was already in the
pagecache.

* xfs: punt uncached write completions to the completion wq

Dumb nit: spaces between "IOMAP_F_SHARED|IOMAP_F_UNCACHED" in this
patch.

* xfs: flag as supporting FOP_UNCACHED

Otherwise the xfs changes look ready too.

* btrfs: add support for uncached writes
* block: support uncached IO

Not sure why the definition of bio_dirty_lock gets moved around, but in
principle this looks ok to me too.

For the whole pile of mm changes (aka patches 1-6,8-10,12),
Acked-by: "Darrick J. Wong" <djwong@xxxxxxxxxx>

--D

>
> include/linux/fs.h | 21 +++++-
> include/linux/page-flags.h | 5 ++
> include/linux/pagemap.h | 14 ++++
> include/trace/events/mmflags.h | 3 +-
> include/uapi/linux/fs.h | 6 +-
> mm/filemap.c | 114 +++++++++++++++++++++++++++++----
> mm/readahead.c | 22 +++++--
> mm/swap.c | 2 +
> mm/truncate.c | 35 ++++++----
> 9 files changed, 187 insertions(+), 35 deletions(-)
>
> Since v5
> - Skip invalidation in filemap_uncached_read() if the folio is dirty
> as well, retaining the uncached setting for later cleaning to do
> the actual invalidation.
> - Use the same trylock approach in read invalidation as the writeback
> invalidation does.
> - Swap order of patches 10 and 11 to fix a bisection issue.
> - Split core mm changes and fs series patches. Once the generic side
> has been approved, I'll send out the fs series separately.
> - Rebase on 6.13-rc1
>
> --
> Jens Axboe
>
>