Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
From: Al Viro
Date: Fri Sep 23 2022 - 12:14:10 EST
On Fri, Sep 23, 2022 at 01:44:33AM -0700, Christoph Hellwig wrote:
> Why would I? We generall do have or should have the iov_iter around.
Not for async IO.
> And for the common case where we don't (bios) we can carry that
> information in the bio as it needs a special unmap helper anyway.
*Any* async read_iter is like that.
> > Where are they getting
> > dropped and what guarantees that IO is complete by that point?
>
> The exact place depens on the exact taaraget frontend of which we have
> a few. But it happens from the end_io callback that is triggered
> through a call to target_complete_cmd.
OK...
> > The reason I'm asking is that here you have an ITER_BVEC possibly fed to
> > __blkdev_direct_IO_async(), with its
> > if (iov_iter_is_bvec(iter)) {
> > /*
> > * Users don't rely on the iterator being in any particular
> > * state for async I/O returning -EIOCBQUEUED, hence we can
> > * avoid expensive iov_iter_advance(). Bypass
> > * bio_iov_iter_get_pages() and set the bvec directly.
> > */
> > bio_iov_bvec_set(bio, iter);
> > which does *not* grab the page referneces. Sure, bio_release_pages() knows
> > to leave those alone and doesn't drop anything. However, what is the
> > mechanism preventing the pages getting freed before the IO completion
> > in this case?
>
> The contract that callers of bvec iters need to hold their own
> references as without that doing I/O do them would be unsafe. It they
> did not hold references the pages could go away before even calling
> bio_iov_iter_get_pages (or this open coded bio_iov_bvec_set).
You are mixing two issues here - holding references to pages while using
iov_iter instance is obvious; holding them until async IO is complete, even
though struct iov_iter might be long gone by that point is a different
story.
And originating iov_iter instance really can be long-gone by the time
of IO completion - requirement to keep it around would be very hard to
satisfy. I've no objections to requiring the pages in ITER_BVEC to be
preserved at least until the IO completion by means independent of
whatever ->read_iter/->write_iter does to them, but
* that needs to be spelled out very clearly and
* we need to verify that it is, indeed, the case for all existing
iov_iter_bvec callers, preferably with comments next to non-obvious ones
(something that is followed only by the sync IO is obvious)
That goes not just for bio - if we make get_pages *NOT* grab references
on ITER_BVEC (and I'm all for it), we need to make sure that those
pages won't be retained after the original protection runs out. Which
includes the reference put into struct nfs_request, for example, as well
as whatever ceph transport is doing, etc. Another thing that needs to
be sorted out is __zerocopy_sg_from_iter() and its callers - AFAICS,
all of those are in ->sendmsg() with MSG_ZEROCOPY in flags.
It's a non-trivial amount of code audit - we have about 40 iov_iter_bvec()
callers in the tree, and while many are clearly sync-only... the ones
that are not tend to balloon into interesting analysis of call chains, etc.
Don't get me wrong - that analysis needs to be done, just don't expect
it to be trivial. And it will require quite a bit of cooperation from the
folks familiar with e.g. drivers/target, or with ceph transport layer,
etc.
FWIW, my preference would be along the lines of
Backing memory for any non user-backed iov_iter should be protected
from reuse by creator of iov_iter and that protection should continue
through not just all synchronous operations with iov_iter in question
- it should last until all async operations involving that memory are
finished. That continued protection must not depend upon taking
extra page references, etc. while we are working with iov_iter.
But getting there will take quite a bit of code audit and probably some massage
as well.