Decoupling filesystems from pages

From: David Howells
Date: Sun Sep 12 2021 - 09:21:21 EST


Hi Johannes,

> Wouldn't it make more sense to decouple filesystems from "paginess",
> as David puts it, now instead? Avoid the risk of doing it twice, avoid
> the more questionable churn inside mm code, avoid the confusing
> proximity to the page and its API in the long-term...

Let me seize that opening. I've been working on doing this for network
filesystems - at least those that want to buy in. If you look here:

https://lore.kernel.org/ceph-devel/162687506932.276387.14456718890524355509.stgit@xxxxxxxxxxxxxxxxxxxxxx/T/#m23428c315a77d8c5206b9646bf74c8ef18d4d38c

the current state of which is here:

https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=netfs-folio-regions

I've been looking at abstracting anything to do with pages out of the netfs
and putting that stuff into a helper library. The library handles all the
caching stuff and just presents the filesystem with requests to read
into/write from an iov_iter. The filesystem doesn't then see pages at all.

The motivation behind this is to make content encryption and compression
transparent and automatically available to all participating filesystems -
with the requirement that the data stored in the local disk cache
(ie. fscache) is *also* encrypted.

I have content encryption working for basic read and write on afs and Jeff
Layton is looking at how to make it work with ceph - but it's very much a work
in progress and things like truncate and mmap don't yet work with it.

Anyway, the library, as I'm currently writing it, maintains a list of
byte-range dirty regions on each inode, where a dirty region may span multiple
folios and a folio may be contributory to multiple regions. The fact that
pages are involved is really then merely an implementation detail

Content encryption/compression blocks may be any power-of-2 size, from 2 bytes
to megabytes, and this need bear no relation to page size. The library calls
the crypto hooks for each crypto block in the chunk[*] to be crypted.

[*] Terminology is such fun. I have to deal with pages, crypto blocks, object
layout blocks, I/O blocks (rsize/wsize settings), regions.

In fact ->readpage(), ->writepage() and ->launder_page() are difficult when I
may be required to deal with blocks larger than the size of a page. The page
being poked may be in the middle of a block, so I'm endeavouring to work
around that. Using the regions should allow me to 'launder' an inode before
invalidating the pages attached to it, and the dirty region objects can act
instead of the dirty, writeback and fscache flags on a page.

I've been building this on top of Willy's folio patchset, and so I've paused
for the moment whilst I wait to see what becomes of that. If folios doesn't
get in or gets renamed, I have a load of reworking to do.

Does this sound like something you'd be interested in looking at more
generally than just network filesystems?

David