Re: How to convert I/O iterators to iterators, sglists and RDMA lists

From: David Howells
Date: Thu Oct 20 2022 - 10:04:27 EST


Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote:

> > (1) Async direct I/O.
> >
> > In the async case direct I/O, we cannot hold on to the iterator when we
> > return, even if the operation is still in progress (ie. we return
> > EIOCBQUEUED), as it is likely to be on the caller's stack.
> >
> > Also, simply copying the iterator isn't sufficient as virtual userspace
> > addresses cannot be trusted and we may have to pin the pages that
> > comprise the buffer.
>
> This is very related to the discussion we are having related to pinning
> for O_DIRECT with Ira and Al.

Do you have a link to that discussion? I don't see anything obvious on
fsdevel including Ira.

I do see a discussion involving iov_iter_pin_pages, but I don't see Ira
involved in that.

> What block file systems do is to take the pages from the iter and some flags
> on what is pinned. We can generalize this to store all extra state in a
> flags word, or byte the bullet and allow cloning of the iter in one form or
> another.

Yeah, I know. A list of pages is not an ideal solution. It can only handle
contiguous runs of pages, possibly with a partial page at either end. A bvec
iterator would be of more use as it can handle a series of partial pages.

Note also that I would need to turn the pages *back* into an iterator in order
to commune with sendmsg() in the nether reaches of some network filesystems.

> > (2) Crypto.
> >
> > The crypto interface takes scatterlists, not iterators, so we need to
> > be able to convert an iterator into a scatterlist in order to do
> > content encryption within netfslib. Doing this in netfslib makes it
> > easier to store content-encrypted files encrypted in fscache.
>
> Note that the scatterlist is generally a pretty bad interface. We've
> been talking for a while to have an interface that takes a page array
> as an input and return an array of { dma_addr, len } tuples. Thinking
> about it taking in an iter might actually be an even better idea.

It would be nice to be able to pass an iterator to the crypto layer. I'm not
sure what the crypto people think of that.

> > (3) RDMA.
> >
> > To perform RDMA, a buffer list needs to be presented as a QPE array.
> > Currently, cifs converts the iterator it is given to lists of pages,
> > then each list to a scatterlist and thence to a QPE array. I have
> > code to pass the iterator down to the bottom, using an intermediate
> > BVEC iterator instead of a page list if I can't pass down the
> > original directly (eg. an XARRAY iterator on the pagecache), but I
> > still end up converting it to a scatterlist, which is then converted
> > to a QPE. I'm trying to go directly from an iterator to a QPE array,
> > thus avoiding the need to allocate an sglist.
>
> I'm not sure what you mean with QPE. The fundamental low-level
> interface in RDMA is the ib_sge.

Sorry, yes. ib_sge array. I think it appears as QPs on the wire.

> If you feed it to RDMA READ/WRITE requests the interface for that is the
> RDMA R/W API in drivers/infiniband/core/rw.c, which currently takes a
> scatterlist but to which all of the above remarks on DMA interface apply.
> For RDMA SEND that ULP has to do a dma_map_single/page to fill it, which is
> a quite horrible layering violation and should move into the driver, but
> that is going to a massive change to the whole RDMA subsystem, so unlikely
> to happen anytime soon.

In cifs, as it is upstream, in RDMA transmission, the iterator is converted
into a clutch of pages in the top, which is converted back into iterators
(smbd_send()) and those into scatterlists (smbd_post_send_data()), thence into
sge lists (see smbd_post_send_sgl()).

I have patches that pass an iterator (which it decants to a bvec if async) all
the way down to the bottom layer. Snippets are then converted to scatterlists
and those to sge lists. I would like to skip the scatterlist intermediate and
convert directly to sge lists.

On the other hand, if you think the RDMA API should be taking scatterlists
rather than sge lists, that would be fine. Even better if I can just pass an
iterator in directly - though neither scatterlist nor iterator has a place to
put the RDMA local_dma_key - though I wonder if that's actually necessary for
each sge element, or whether it could be handed through as part of the request
as a hole.

> Neither case has anything to do with what should be in common iov_iter
> code, all this needs to live in the RDMA subsystem as a consumer.

That's fine in principle. However, I have some extraction code that can
convert an iterator to another iterator, an sglist or an rdma sge list, using
a common core of code to do all three.

I can split it up if that is preferable.

Do you have code that's ready to be used? I can make immediate use of it.

David