Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines

From: John Hubbard
Date: Thu Sep 22 2022 - 02:09:21 EST


On 9/21/22 19:22, Al Viro wrote:
> On Thu, Sep 15, 2022 at 10:16:25AM +0200, Jan Kara wrote:
>
>>> How would that work? What protects the area where you want to avoid running
>>> into pinned pages from previously acceptable page getting pinned? If "they
>>> must have been successfully unmapped" is a part of what you are planning, we
>>> really do have a problem...
>>
>> But this is a very good question. So far the idea was that we lock the
>> page, unmap (or writeprotect) the page, and then check pincount == 0 and
>> that is a reliable method for making sure page data is stable (until we
>> unlock the page & release other locks blocking page faults and writes). But
>> once suddently ordinary page references can be used to create pins this
>> does not work anymore. Hrm.
>>
>> Just brainstorming ideas now: So we'd either need to obtain the pins early
>> when we still have the virtual address (but I guess that is often not
>> practical but should work e.g. for normal direct IO path) or we need some
>> way to "simulate" the page fault when pinning the page, just don't map it
>> into page tables in the end. This simulated page fault could be perhaps
>> avoided if rmap walk shows that the page is already mapped somewhere with
>> suitable permissions.
>
> OK. As far as I can see, the rules are along the lines of
> * creator of ITER_BVEC/ITER_XARRAY is responsible for pages being safe.
> That includes
> * page known to be locked by caller
> * page being privately allocated and not visible to anyone else
> * iterator being data source
> * page coming from pin_user_pages(), possibly as the result of
> iov_iter_pin_pages() on ITER_IOVEC/ITER_UBUF.
> * ITER_PIPE pages are always safe
> * pages found in ITER_BVEC/ITER_XARRAY are safe, since the iterator
> had been created with such.
> My preference would be to have iov_iter_get_pages() and friends pin if and
> only if we have data-destination iov_iter that is user-backed. For
> data-source user-backed we only need FOLL_GET, and for all other flavours
> (ITER_BVEC, etc.) we only do get_page(), if we need to grab any references
> at all.

This rule would mostly work, as long as we can relax it in some cases, to
allow pinning of both source and dest pages, instead of just destination
pages, in some cases. In particular, bio_release_pages() has lost all
context about whether it was a read or a write request, as far as I can
tell. And bio_release_pages() is the primary place to unpin pages for
direct IO.

>
> What I'd like to have is the understanding of the places where we drop
> the references acquired by iov_iter_get_pages(). How do we decide
> whether to unpin? E.g. pipe_buffer carries a reference to page and no
> way to tell whether it's a pinned one; results of iov_iter_get_pages()
> on ITER_IOVEC *can* end up there, but thankfully only from data-source
> (== WRITE, aka. ITER_SOURCE) iov_iter. So for those we don't care.
> Then there's nfs_request; AFAICS, we do need to pin the references in
> those if they are coming from nfs_direct_read_schedule_iovec(), but
> not if they come from readpage_async_filler(). How do we deal with
> coalescence, etc.? It's been a long time since I really looked at
> that code... Christoph, could you give any comments on that one?
>
> Note, BTW, that nfs_request coming from readpage_async_filler() have
> pages locked by caller; the ones from nfs_direct_read_schedule_iovec()
> do not, and that's where we want them pinned. Resulting page references
> end up (after quite a trip through data structures) stuffed into struct
> rpc_rqst ->rc_recv_buf.pages[] and when a response arrives from server,
> they get picked by xs_read_bvec() and fed to iov_iter_bvec(). In one
> case it's safe since the pages are locked; in another - since they would
> come from pin_user_pages(). The call chain at the time they are used
> has nothing to do with the originator - sunrpc is looking at the arrived
> response to READ that matches an rpc_rqst that had been created by sender
> of that request and safety is the sender's responsibility.

For NFS Direct, is there any reason it can't be as simple as this
(conceptually, that is--the implementation of iov_iter_pin_pages_alloc()
is not shown here)? Here:


diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 1707f46b1335..7dbc705bab83 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -142,13 +142,6 @@ int nfs_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
return 0;
}

-static void nfs_direct_release_pages(struct page **pages, unsigned int npages)
-{
- unsigned int i;
- for (i = 0; i < npages; i++)
- put_page(pages[i]);
-}
-
void nfs_init_cinfo_from_dreq(struct nfs_commit_info *cinfo,
struct nfs_direct_req *dreq)
{
@@ -332,7 +325,7 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
size_t pgbase;
unsigned npages, i;

- result = iov_iter_get_pages_alloc2(iter, &pagevec,
+ result = iov_iter_pin_pages_alloc(iter, &pagevec,
rsize, &pgbase);
if (result < 0)
break;
@@ -362,7 +355,16 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
pos += req_len;
dreq->bytes_left -= req_len;
}
- nfs_direct_release_pages(pagevec, npages);
+
+ /*
+ * iov_iter_pin_pages_alloc() calls pin_user_pages_fast() for
+ * the user_backed_iter() case (only).
+ */
+ if (user_backed_iter(iter))
+ unpin_user_pages(pagevec, npages);
+ else
+ release_pages(pagevec, npages);
+
kvfree(pagevec);
if (result < 0)
break;
@@ -829,7 +831,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
pos += req_len;
dreq->bytes_left -= req_len;
}
- nfs_direct_release_pages(pagevec, npages);
+ release_pages(pagevec, npages);
kvfree(pagevec);
if (result < 0)
break;

thanks,

--
John Hubbard
NVIDIA