Re: [PATCH v2 1/2] vmsplice: unmap gifted pages for recipient

From: Vlastimil Babka
Date: Mon Nov 04 2013 - 11:16:43 EST


On 10/25/2013 05:46 PM, Robert Jennings wrote:
> From: Robert C Jennings <rcj@xxxxxxxxxxxxxxxxxx>
>
> Introduce use of the unused SPLICE_F_MOVE flag for vmsplice to zap
> pages.
>
> When vmsplice is called with flags (SPLICE_F_GIFT | SPLICE_F_MOVE) the
> writer's gift'ed pages would be zapped. This patch supports further work
> to move vmsplice'd pages rather than copying them. That patch has the
> restriction that the page must not be mapped by the source for the move,
> otherwise it will fall back to copying the page.
>
> Signed-off-by: Matt Helsley <matt.helsley@xxxxxxxxx>
> Signed-off-by: Robert C Jennings <rcj@xxxxxxxxxxxxxxxxxx>
> ---
> Changes since v1:
> - Cleanup zap coalescing in splice_to_pipe for readability
> - Field added to struct partial_page in v1 was unnecessary, using
> private field instead.
> ---
> fs/splice.c | 38 ++++++++++++++++++++++++++++++++++++++
> 1 file changed, 38 insertions(+)
>
> diff --git a/fs/splice.c b/fs/splice.c
> index 3b7ee65..c14be6f 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -188,12 +188,18 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
> {
> unsigned int spd_pages = spd->nr_pages;
> int ret, do_wakeup, page_nr;
> + struct vm_area_struct *vma;
> + unsigned long user_start, user_end, addr;
>
> ret = 0;
> do_wakeup = 0;
> page_nr = 0;
> + vma = NULL;
> + user_start = user_end = 0;
>
> pipe_lock(pipe);
> + /* mmap_sem taken for zap_page_range with SPLICE_F_MOVE */
> + down_read(&current->mm->mmap_sem);

I have suggested taking the semaphore here only when the gift and move
flags are set. You said that taking it outside the loop and acquiring it
once already improved performance. This is OK, but my point was to not
take the semaphore at all for vmsplice calls without these flags, to
avoid unnecessary contention.

>
> for (;;) {
> if (!pipe->readers) {
> @@ -215,6 +221,33 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
> if (spd->flags & SPLICE_F_GIFT)
> buf->flags |= PIPE_BUF_FLAG_GIFT;
>
> + /* Prepare to move page sized/aligned bufs.
> + * Gather pages for a single zap_page_range()
> + * call per VMA.
> + */
> + if (spd->flags & (SPLICE_F_GIFT | SPLICE_F_MOVE) &&
> + !buf->offset &&
> + (buf->len == PAGE_SIZE)) {
> + addr = buf->private;

Here you assume that buf->private (initialized from
spd->partial[page_nr].private) will contain a valid address whenever the
GIFT and MOVE flags are set. I think that's quite dangerous and could be
easily exploited. Briefly looking it seems to me that at least one
caller of splice_to_pipe(), __generic_file_splice_read() doesn't
initialize the on-stack-allocated private fields, and it can take flags
directly from the splice syscall.

> +
> + if (vma && (addr == user_end) &&
> + (addr + PAGE_SIZE <= vma->vm_end)) {
> + /* Same vma, no holes */
> + user_end += PAGE_SIZE;
> + } else {
> + if (vma)
> + zap_page_range(vma, user_start,
> + (user_end - user_start),
> + NULL);
> + vma = find_vma(current->mm, addr);

Seems like there is a good chance that when crossing over previous vma's
vm_end, taking the next vma would suffice instead of find_vma().

> + if (!IS_ERR_OR_NULL(vma)) {
> + user_start = addr;
> + user_end = (addr + PAGE_SIZE);
> + } else
> + vma = NULL;
> + }
> + }
> +
> pipe->nrbufs++;
> page_nr++;
> ret += buf->len;
> @@ -255,6 +288,10 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
> pipe->waiting_writers--;
> }
>
> + if (vma)
> + zap_page_range(vma, user_start, (user_end - user_start), NULL);
> +
> + up_read(&current->mm->mmap_sem);
> pipe_unlock(pipe);
>
> if (do_wakeup)
> @@ -1475,6 +1512,7 @@ static int get_iovec_page_array(const struct iovec __user *iov,
>
> partial[buffers].offset = off;
> partial[buffers].len = plen;
> + partial[buffers].private = (unsigned long)base;
>
> off = 0;
> len -= plen;
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/