How to convert I/O iterators to iterators, sglists and RDMA lists
From: David Howells
Date: Fri Oct 14 2022 - 11:27:30 EST
Hi Christoph, Al,
One of the aims I have for netfslib is to hide the involvement of pages/folios
entirely from the filesystem. That way the filesystem need not concern itself
with changes such as multipage folios appearing in the VM.
To this end, I'm trying to make it such that each netfs_io_subrequest contains
an iterator that describes the segment of buffer that a subrequest is dealing
with. The filesystem interprets the buffer appropriately, and can even pass
the iterator directly to kernel_sendmsg() or kernel_recvmsg() if this is
convenient.
In netfslib and in the network filesystems using it, however, there are a
number of situations where we need to "convert" an iterator:
(1) Async direct I/O.
In the async case direct I/O, we cannot hold on to the iterator when we
return, even if the operation is still in progress (ie. we return
EIOCBQUEUED), as it is likely to be on the caller's stack.
Also, simply copying the iterator isn't sufficient as virtual userspace
addresses cannot be trusted and we may have to pin the pages that
comprise the buffer.
(2) Crypto.
The crypto interface takes scatterlists, not iterators, so we need to be
able to convert an iterator into a scatterlist in order to do content
encryption within netfslib. Doing this in netfslib makes it easier to
store content-encrypted files encrypted in fscache.
(3) RDMA.
To perform RDMA, a buffer list needs to be presented as a QPE array.
Currently, cifs converts the iterator it is given to lists of pages, then
each list to a scatterlist and thence to a QPE array. I have code to
pass the iterator down to the bottom, using an intermediate BVEC iterator
instead of a page list if I can't pass down the original directly (eg. an
XARRAY iterator on the pagecache), but I still end up converting it to a
scatterlist, which is then converted to a QPE. I'm trying to go directly
from an iterator to a QPE array, thus avoiding the need to allocate an
sglist.
Constraints:
(A) Userspace gives us a list (IOVEC/UBUF) of buffers that may not be page
aligned and may not be contiguous; further, within a particular buffer
span, the pages may not be contiguous and may be part of multipage
folios.
Converting to a BVEC iterator allows a whole buffer to be described, and
extracting a subset of a BVEC iterator is straightforward.
(B) Kernel buffers may not be pinnable. If we get a KVEC iterator, say, we
can't assume that we can pin the pages (say the buffer is part of the
kernel rodata or belongs to a device - say a flash).
This may also apply to mmap'd devices in userspace iovecs.
(C) We don't want to pin pages if we can avoid it.
(D) PIPE iterators.
So, my first attempt at dealing with (1) involved creating a function that
extracted part of an iterator into another iterator[2]. Just copying and
shaping if possible (assuming, say, that an XARRAY iterator doesn't need to
pin the pages), but otherwise using repeated application of
iov_iter_get_pages() to build up a BVEC iterator (which is basically just a
list of {page,offset,len} tuples).
Al objected on the basis that it was pinning pages that it didn't need to (say
extracting BVEC->BVEC) and that it didn't deal correctly with PIPE (because
the underlying pipe would get advanced too early) or KVEC/BVEC (because it
might refer to a page that was un-get_pages-able).
Christoph objected that it shouldn't be available as a general purpose helper
and that it should be kept inside cifs - but I'm wanting to use it inside of
netfslib also.
My first attempt at dealing with (2) involved creating a function to scan an
iterator[2] and call a function on each segment of it. This could be used to
perform checksumming or to build up a scatterlist. However, as Al pointed
out, I didn't get the IOBUF or KVEC handling right. Mostly, though, I want to
convert to an sglist and work from that.
I then had a go at implementing a common framework[3] to extract an iterator
into another iterator, an sglist, a RDMA QPE array or any other type of list
we might envision. Al's not keen on that for a number of reasons (see his
reply) including that it loses type safety and that I should be using
iov_iter_get_pages2() - which he already objected to me doing in[1]:-/
So any thoughts on what the right way to do this is? What is the right API?
I have three things I need to make from a source iterator: a copy and/or a
subset iterator, a scatterlist and an RDMA QPE array, and several different
types of iterator to extract from. I shouldn't pin pages unless I need to,
sometimes pages cannot be pinned and sometimes I may have to add the physical
address to the entry.
If I can share part of the infrastructure, that would seem to be a good thing.
David
https://lore.kernel.org/r/165364824259.3334034.5837838050291740324.stgit@xxxxxxxxxxxxxxxxxxxxxx/ [1]
https://lore.kernel.org/r/165364824973.3334034.10715738699511650662.stgit@xxxxxxxxxxxxxxxxxxxxxx/ [2]
https://lore.kernel.org/r/3750754.1662765490@xxxxxxxxxxxxxxxxxxxxxx/ [3]