Re: [git pull] vfs.git get_user_pages_fast() conversion

From: Dan Williams
Date: Sat Nov 18 2017 - 16:49:27 EST

On Fri, Nov 17, 2017 at 1:32 PM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
> On Fri, Nov 17, 2017 at 12:50:47PM -0800, Linus Torvalds wrote:
>> Not because the conversion was wrong, but because the original code is
>> so broken.
>> In particular, that "1" that is unchanged in the arguments is correct
>> in the conversion, but it was completely wrong in the original, even
>> if it happened to work.
>> it _should_ have been a FOLL_WRITE. Yes, it happens to have that
>> value, but it was broken.
>> (I note that a bit of grepping shows we have the same issue in a stale
>> comment in mm/ksm.c).
>> It would have been nice to see things like this mentioned in the commit message.
>> Because I'm pretty sure you actually _realized_ that as you made the
>> conversion, but there's no sign of that in the logs, because the
>> commit message just says
>> atomisp: use get_user_pages_fast()
>> without mentioning how broken the old case was (even it if happened to work).
> Point... Frankly, my impression from the whole thing is that get_user_pages()
> and get_user_pages_unlocked() calling conventions are overcomplicated, especially
> since most of the callers either should be get_user_pages_fast() or happen
> to be inside implementations of get_user_pages_fast(). Grepping for the
> remaining callers right now yields just this:
> arch/cris/arch-v32/drivers/cryptocop.c:2722: err = get_user_pages((unsigned long int)(oper.indata + prev_ix),
> arch/cris/arch-v32/drivers/cryptocop.c:2736: err = get_user_pages((unsigned long int)oper.cipher_outdata,
> arch/ia64/kernel/err_inject.c:145: ret = get_user_pages(virt_addr, 1, FOLL_WRITE, NULL, NULL);
> arch/x86/mm/mpx.c:550: gup_ret = get_user_pages((unsigned long)addr, nr_pages,
> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c:646: r = get_user_pages(userptr, num_pages, flags, p, NULL);
> drivers/gpu/drm/radeon/radeon_ttm.c:571: r = get_user_pages(userptr, num_pages, write ? FOLL_WRITE : 0,
> drivers/infiniband/core/umem.c:194: ret = get_user_pages(cur_base,
> drivers/infiniband/hw/mthca/mthca_memfree.c:475: ret = get_user_pages(uaddr & PAGE_MASK, 1, FOLL_WRITE, pages, NUL
> L);
> drivers/infiniband/hw/qib/qib_user_pages.c:70: ret = get_user_pages(start_page + got * PAGE_SIZE,
> drivers/infiniband/hw/usnic/usnic_uiom.c:146: ret = get_user_pages(cur_base,
> drivers/media/pci/ivtv/ivtv-udma.c:127: err = get_user_pages_unlocked(user_dma.uaddr, user_dma.page_count,
> drivers/media/pci/ivtv/ivtv-yuv.c:78: y_pages = get_user_pages_unlocked(y_dma.uaddr,
> drivers/media/pci/ivtv/ivtv-yuv.c:82: uv_pages = get_user_pages_unlocked(uv_dma.uaddr,
> drivers/media/v4l2-core/videobuf-dma-sg.c:188: err = get_user_pages(data & PAGE_MASK, dma->nr_pages,
> drivers/misc/mic/scif/scif_rma.c:1399: pinned_pages->nr_pages = get_user_pages(
> drivers/misc/sgi-gru/grufault.c:201: if (get_user_pages(vaddr, 1, write ? FOLL_WRITE : 0, &page, NULL) <= 0)
> mm/mempolicy.c:829: err = get_user_pages(addr & PAGE_MASK, 1, 0, &p, NULL);
> virt/kvm/kvm_main.c:1326: return get_user_pages(start, 1, flags, page, NULL);
> virt/kvm/kvm_main.c:1333: rc = get_user_pages(addr, 1, flags, NULL, NULL);
> virt/kvm/kvm_main.c:1395: npages = get_user_pages_unlocked(addr, 1, page, flags);
> and even those are dubious - e.g. cris ones could bloody well become a pair of
> get_user_pages_fast(), without ->mmap_sem being held across both calls. Itanic
> one is almost certainly buggered - we are not holding ->mmap_sem there.
> Hell knows... I wonder if exposing FOLL_... thing outside of mm/gup.c actually
> makes sense. With the users so heavily skewed towards just two combinations of
> flags (0 and FOLL_WRITE)...
> And for get_user_pages() itself it's even more ridiculous - vmalist (the last
> argument) is non-NULL in only one caller. Which uses it only to check if all
> of the VMAs happen to be hugetlb ones, apparently.

One note here, we've discovered that filesystem-dax mappings are
broken with respect to DMA and fallocate(PUNCH_HOLE). As a band-aid
we're looking to change rdma, video/media, and any other subsystem
that tries to pin pages indefinitely to fail in the vma_is_dax() case
with a new get_user_pages_longterm() helper [1]. Then the plan is to
follow on with new infrastructure to register memory with a file
lease. We need an interface for the kernel to notify userspace that
the pages it pinned need to be unpinned because the file blocks are
being deallocated (where 'file blocks' and 'memory pages' are one in
the same in the dax case).