[no subject]
From: David Howells
Date: Wed Jul 21 2021 - 09:46:01 EST
[RFC PATCH 00/12] netfs: Experimental write helpers, fscrypt and compression
Date:
Hi all,
I've been working on extending the netfs helper library to provide write
support (even VM support) for the filesystems that want to use it, with an
eye to building in transparent content crypto support (eg. fscrypt) - so
that content-encrypted data is stored in fscache in encrypted form - and
also client-side compression (something that cifs/smb supports, I believe,
and something that afs may acquire in the future).
This brings interesting issues with PAGE_SIZE potentially being smaller
than the I/O block size, and thus having to make sure pages that aren't
locally modified stay retained. Note that whilst folios could, in theory,
help here, a folio requires contiguous RAM.
So here's the changes I have so far (WARNING: it's experimental, so may
contain debugging stuff, notes and extra bits and it's not fully
implemented yet). The changes can also be found here:
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=netfs-regions-experimental
With this, I can do simple reads and writes through afs, and the
modifications can be made encrypted to both the server and the cache
(though I haven't yet written the decryption side of it).
REGION LISTS
============
One of the things that I've been experimenting with is keeping a list of
dirty regions on each inode separate from the dirty bits on the page.
Region records can then be used to manage the spans of pages required to
construct a crypto or compression buffer.
With these records available, other possibilities become available:
(1) I can use a separate list of regions to keep track of the pending and
active writes to an inode. This allows the inode lock to be dropped
as soon as the region is added, with the record acting as a region
lock. (Truncate/fallocate is then just a special sort of direct write
operation).
(2) Keeping a list of active writes allows parallel non-overlapping[*]
write operations to the pagecache and possibly parallel DIO write
operations to the server (I think cifs to Windows allows this).
[*] Non-overlapping in the sense that, under some circumstances, they
aren't even allowed to touch the same page/folio.
(3) After downloading data from the server, we may need to write it to the
cache. This can be deferred to the VM writeback mechanism by adding a
'semi-dirty' region and marking the pages dirty.
(4) Regions can be grouped so that the groups have to be flushed in order,
thereby allowing ceph snaps and fsync to be implemented by the same
mechanism and offering a possible RWF_BARRIER for pwritev2().
(5) No need for write_begin/write_end in the filesystem - this is handled
using the region in netfs_perform_write().
(6) PG_fscache is no longer required as the region record can track this.
(7) page->private isn't needed to track dirty state.
I also keep a flush list of regions that need writing. If we can't manage
to lock all the pages in a part of the region we want to change, we drop
the locks we've taken and defer. Since the following should be true:
- since a dirty region represents data in the pagecache that needs
writing, the pages containing that data must be present in RAM;
- a region in the flushing state acts as an exclusive lock against
overlapping active writers (which must wait for it);
- the ->releasepage() and ->migratepage() methods can be used to prevent
the page from being lost
it might be feasible to access the page *without* taking the page lock.
The flusher can split an active region in order to write out part of it,
provided it does so at a page boundary that's at or less than the dirtied
point. This prevents an in-progress write pinning the entirety of memory.
An alternative to using region records that I'm pondering is to pull the
NFS code for page handling into the netfs lib. I'm not sure that this
would make it easier to handle multipage spans, though, as releasepage
would need to look at the pages either side. "Regions" would also be
concocted on the fly by writepages() - but, again, this may require the
involvement of other pages so I would have to be extremely careful of
deadlock.
PROVISION OF BUFFERING
======================
Another of the things I'm experimenting with is sticking buffers in xarray
form in the read and write request structs.
On the read side, this allows a buffer larger than the requested size to be
employed, with the option to discard the excess data or splice it over into
the pagecache - for instance if we get a compressed blob that we don't know
the size of yet or that is larger than the hole we have available in the
pagecache. A second buffer can be employed to decrypt or decryption can be
done in place, depending on whether we want to copy the encrypted data to
the pagecache.
On the write side, this can be used to encrypt into, with the buffer then
being written to the cache and the server rather than the original. If
compression is involved, we might want two buffers: we might need to copy
the original into the first buffer so that it doesn't change during
compression, then compress into the second buffer (which could then be
encrypted - if that makes sense).
With regard to DIO, if crypto is required, the helpers would copy the data
in or out of separate buffers, crypting the buffers and uploading or
downloading the buffers to/from the server. I could even make it handle
RMW for smaller reads, but that needs to be careful because of the
possibility of collision with remote conflicts.
HOW NETFSLIB WOULD BE USED
==========================
In the scheme I'm experimenting with, I envision that a filesystem would
add a netfs context directly after its inode, e.g.:
struct afs_vnode {
struct {
struct inode vfs_inode;
struct netfs_i_context netfs_ctx;
};
...
};
and then point many of its inode, address space and VM methods directly at
netfslib, e.g.:
const struct file_operations afs_file_operations = {
.open = afs_open,
.release = afs_release,
.llseek = generic_file_llseek,
.read_iter = generic_file_read_iter,
.write_iter = netfs_file_write_iter,
.mmap = afs_file_mmap,
.splice_read = generic_file_splice_read,
.splice_write = iter_file_splice_write,
.fsync = netfs_fsync,
.lock = afs_lock,
.flock = afs_flock,
};
const struct address_space_operations afs_file_aops = {
.readpage = netfs_readpage,
.readahead = netfs_readahead,
.releasepage = netfs_releasepage,
.invalidatepage = netfs_invalidatepage,
.writepage = netfs_writepage,
.writepages = netfs_writepages,
};
static const struct vm_operations_struct afs_vm_ops = {
.fault = filemap_fault,
.map_pages = filemap_map_pages,
.page_mkwrite = netfs_page_mkwrite,
};
though it can, of course, wrap them if it needs to. The inode context
stores any required caching cookie, crypto management parameters and an
operations table.
The netfs lib would be providing helpers for write_iter, page_mkwrite,
writepage, writepages, fsync, truncation and remote invalidation - the idea
being that the filesystem then just needs to provide hooks to perform read
and write RPC operations plus other optional hooks for the maintenance of
state and to help manage grouping, shaping and slicing I/O operations and
doing content crypto, e.g.:
const struct netfs_request_ops afs_req_ops = {
.init_rreq = afs_init_rreq,
.begin_cache_operation = afs_begin_cache_operation,
.check_write_begin = afs_check_write_begin,
.issue_op = afs_req_issue_op,
.cleanup = afs_priv_cleanup,
.init_dirty_region = afs_init_dirty_region,
.free_dirty_region = afs_free_dirty_region,
.update_i_size = afs_update_i_size,
.init_wreq = afs_init_wreq,
.add_write_streams = afs_add_write_streams,
.encrypt_block = afs_encrypt_block,
};
SERVICES THE HELPERS WOULD PROVIDE
==================================
The helpers are intended to transparently provide a number of services to
all the filesystems that want to use them:
(1) Handling of multipage folios. The helpers provide iov_iters to the
filesystem indicating the pages to be read/written. These may point
into the pagecache, may point to userspace for unencrypted DIO or may
point to a separate buffer for cryption/compression. The fs doesn't
see any pages/folios unless it wants to.
(2) Handling of content encryption (e.g. fscrypt). Encrypted data should
be encrypted in fscache. The helpers will write the downloaded
encrypted data to the cache and will write modified data to the cache
after it had been encrypted.
The filesystem will provide the actual crypto, though the helpers can
do the block-by-block iteration and setting up of scatterlists. The
intention is that if fscrypt is being used, the helper will be there.
(3) Handling of compressed data. If the data is stored in compressed
blocks on the server, whereby the client does the (de)compression
locally, support for handling that is similar to crypto.
The helpers will provide the buffers and filesystem will provide the
compression, though the filesystem can expand the buffers as needed.
(4) Handling of I/O block sizes larger than page size. If the filesystem
needs to perform a block RPC I/O that's larger than page size - say it
has to deal with full-file crypto or a large compression blocksize -
the helpers will keep around and gather together larger units to make
it possible to handle writes.
For a read of a larger block size, the helpers create a buffer of the
size required, padding it with extra pages as necessary and read into
that. The extra pages can then be spliced into holes in the pagecache
rather than being discarded.
(5) Handling of direct I/O. The helpers will break down DIO requests into
slices based on the rsize/wsize and can also do content crypto and
(de)compression on the data.
In the encrypted case, I would, initially at least, make it so that
the DIO blocksize is set to a multiple of the crypto blocksize. I
could allow it to be smaller: when reading, I can just discard the
excess, but on writing I would need to implement some sort of RMW
cycle.
(6) Handling of remote invalidation. The helpers would be able to operate
in a number of modes when local modifications exist:
- discard all local changes
- download the new version and reapply local changes
- keep local version and overwrite server version
- stash local version and replace with new version
(7) Handling of disconnected operation. Given feedback from the
filesystem to indicate when we're in disconnected operation, the
helpers would save modified code only to the cache, along with a list
of modified regions.
Upon reconnection, we would need to sync back to the server - and then
the handling of remote invalidation would apply when we hit a conflict.
THE CHANGES I'VE MADE SO FAR
============================
The attached patches make a partial attempt at the above and partially
convert the afs filesystem to use them. It is by no means complete,
however, and almost certainly contains bugs beyond the bits not yet wholly
implemented.
To this end:
(1) struct netfs_dirty_region defined a region. This is primarily used to
track which portions of an inode's pagecache are dirty and in what
manner. Not all dirty regions are equal.
(2) Contiguous dirty regions may be mergeable or one may supersede part of
another (a local modification supersedes a download), depending on
type, state and other stuff. netfs_merge_dirty_region() deals with
this.
(3) A read from the server will generate a semi-dirty region that is to be
written to the cache only. Such writes to the cache are then driven
by the VM, no longer being dispatched automatically on completion of
the read.
(4) DSYNC writes supersede ordinary writes, may not be merged and are
flushed immediately. The writer then waits for that region to finish
flushing. (Untested)
(5) Every region belongs to a flush group. This provides the opportunity
for writes to be grouped and for the groups to be flushed in order.
netfs_flush_region() will flush older regions. (Untested)
(6) The netfs_dirty_region struct is used to manage write operations on an
inode. The inode has two lists for this: pending writes and active
writes. A write request is initially put onto the pending list until
the region it wishes to modify becomes free of active writes, then
it's moved to the active list.
Writes on the active list are not allowed to overlap in their reserved
regions. This acts as a region lock, allowing the inode lock to be
dropped immediately after the record is queued.
(7) Each region has a bounding box that indicates where the start and end
of the pages involved are. The bounding box is expanded to fit
crypto, compression and cache blocksize requirements.
Incompatible writes are not allowed to share bounding boxes (e.g. DIO
writes may not overlap with other writes as the pagecache needs
invalidation thereafter).
This is extra complicated with the advent of THPs/multipage folios are
the page boundaries are variable.
It might make sense to keep track of partially invalid regions too and
require them to be downloaded before allowing them to be read.
(8) An active write is not permitted to proceed until any flushing regions
it overlaps with are complete. At that point, it is also added to the
dirty list. As it progresses, its dirty region is expanded and the
writeback manager may split off part of that to make space. Once it
is complete, it becomes an ordinary dirty region (if not DIO).
(9) When a writeback of part of a region occurs, pages in the bounding box
may be pinned as well as pages containing the modifications as
necessary to perform crypto/compression.
(10) We then have the situation where a page may be holding modifications
from different dirty regions. Under some circumstances (such as the
file being freshly created locally), these will be merged, bridging
the gaps with zeros.
However, if such regions cannot be merged, if we write out one region,
we have to be careful not to clear the dirty mark on the page if
there's another dirty region on it. Similarly, the writeback mark
might need maintaining after a region completes writing.
Note that a 'page' might actually be a multipage folio and could be
quite large - possibly multiple megabytes.
(11) writepage() is an issue. The VM might call us to ask for a page in
the middle of a dirty region be flushed. However, the page is locked
by the caller and we might need pages from either side to actually
perform the write (which might also be locked).
What I'm thinking of here is to have netfs_writepage() find the dirty
region(s) contributory to a dirty page and put them on the flush queue
and then return to the VM saying it couldn't be done at this time.
David
Proposals/information about previous parts of the design have been
published here:
Link: https://lore.kernel.org/r/24942.1573667720@xxxxxxxxxxxxxxxxxxxxxx/
Link: https://lore.kernel.org/r/2758811.1610621106@xxxxxxxxxxxxxxxxxxxxxx/
Link: https://lore.kernel.org/r/1441311.1598547738@xxxxxxxxxxxxxxxxxxxxxx/
Link: https://lore.kernel.org/r/160655.1611012999@xxxxxxxxxxxxxxxxxxxxxx/
v5 of the read helper patches was here:
Link: https://lore.kernel.org/r/161653784755.2770958.11820491619308713741.stgit@xxxxxxxxxxxxxxxxxxxxxx/
---
David Howells (12):
afs: Sort out symlink reading
netfs: Add an iov_iter to the read subreq for the network fs/cache to use
netfs: Remove netfs_read_subrequest::transferred
netfs: Use a buffer in netfs_read_request and add pages to it
netfs: Add a netfs inode context
netfs: Keep lists of pending, active, dirty and flushed regions
netfs: Initiate write request from a dirty region
netfs: Keep dirty mark for pages with more than one dirty region
netfs: Send write request to multiple destinations
netfs: Do encryption in write preparatory phase
netfs: Put a list of regions in /proc/fs/netfs/regions
netfs: Export some read-request ref functions
fs/afs/callback.c | 2 +-
fs/afs/dir.c | 2 +-
fs/afs/dynroot.c | 1 +
fs/afs/file.c | 193 ++------
fs/afs/inode.c | 25 +-
fs/afs/internal.h | 27 +-
fs/afs/super.c | 9 +-
fs/afs/write.c | 397 ++++-----------
fs/ceph/addr.c | 2 +-
fs/netfs/Makefile | 11 +-
fs/netfs/dio_helper.c | 140 ++++++
fs/netfs/internal.h | 104 ++++
fs/netfs/main.c | 104 ++++
fs/netfs/objects.c | 218 +++++++++
fs/netfs/read_helper.c | 460 ++++++++++++-----
fs/netfs/stats.c | 22 +-
fs/netfs/write_back.c | 592 ++++++++++++++++++++++
fs/netfs/write_helper.c | 924 +++++++++++++++++++++++++++++++++++
fs/netfs/write_prep.c | 160 ++++++
fs/netfs/xa_iterator.h | 116 +++++
include/linux/netfs.h | 273 ++++++++++-
include/trace/events/netfs.h | 325 +++++++++++-
22 files changed, 3488 insertions(+), 619 deletions(-)
create mode 100644 fs/netfs/dio_helper.c
create mode 100644 fs/netfs/main.c
create mode 100644 fs/netfs/objects.c
create mode 100644 fs/netfs/write_back.c
create mode 100644 fs/netfs/write_helper.c
create mode 100644 fs/netfs/write_prep.c
create mode 100644 fs/netfs/xa_iterator.h