Re: [PATCH 3/6] nfs: Eliminate nfs_get_user_pages()

From: Andrew Morton
Date: Fri May 19 2006 - 14:14:54 EST


Chuck Lever <cel@xxxxxxxxxx> wrote:
>
> Neil Brown observed that the kmalloc() in nfs_get_user_pages() is more
> likely to fail if the I/O is large enough to require the allocation of more
> than a single page to keep track of all the pinned pages in the user's
> buffer.
>
> Instead of tracking one large page array per dreq/iocb, track pages per
> nfs_read/write_data, just like the cached I/O path does. An array for
> pages is already allocated for us by nfs_readdata_alloc() (and the write
> and commit equivalents).
>
> This is also required for adding support for vectored I/O to the NFS direct
> I/O path.
>
> The original reason to pin the user buffer and allocate all the NFS data
> structures before trying to schedule I/O was to ensure all needed resources
> are allocated on the client before starting to send requests. This reduces
> the chance that resource exhaustion on the client will cause a short read
> or write.
>
> On the other hand, for an application making very large application I/O
> requests, this means that it will be nearly impossible for the application
> to make forward progress on a resource-limited client.
>
> Thus, moving the buffer pinning functionality into the I/O scheduling
> loops should be good for scalability. The next patch will do the same for
> NFS data structure allocation.
>
> +static void nfs_release_user_pages(struct page **pages, int npages)
> {
> - int result = -ENOMEM;
> - unsigned long page_count;
> - size_t array_size;
> -
> - page_count = (user_addr + size + PAGE_SIZE - 1) >> PAGE_SHIFT;
> - page_count -= user_addr >> PAGE_SHIFT;
> -
> - array_size = (page_count * sizeof(struct page *));
> - *pages = kmalloc(array_size, GFP_KERNEL);
> - if (*pages) {
> - down_read(&current->mm->mmap_sem);
> - result = get_user_pages(current, current->mm, user_addr,
> - page_count, (rw == READ), 0,
> - *pages, NULL);
> - up_read(&current->mm->mmap_sem);
> - if (result != page_count) {
> - /*
> - * If we got fewer pages than expected from
> - * get_user_pages(), the user buffer runs off the
> - * end of a mapping; return EFAULT.
> - */
> - if (result >= 0) {
> - nfs_free_user_pages(*pages, result, 0);
> - result = -EFAULT;
> - } else
> - kfree(*pages);
> - *pages = NULL;
> - }
> - }
> - return result;
> + int i;
> + for (i = 0; i < npages; i++)
> + page_cache_release(pages[i]);
> }

If `npages' is negative, this does the right thing.

> + result = get_user_pages(current, current->mm, user_addr,
> + data->npages, 1, 0, data->pagevec, NULL);
> + up_read(&current->mm->mmap_sem);
> + if (unlikely(result < data->npages))
> + goto out_err;
> ...
> +out_err:
> + nfs_release_user_pages(data->pagevec, result);

And `npages' can indeed be negative.

So. No bug there, but the code is a little unobvious and fragile - if
someone were to alter a type then subtle bugs would happen.

Perhaps

if (result > 0)
nfs_release_user_pages(...);

would be cleaner. Or at least a loud comment in nfs_release_user_pages().
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/