Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN

From: John Hubbard
Date: Mon Nov 04 2019 - 14:04:45 EST


On 11/4/19 9:33 AM, Jerome Glisse wrote:
...
>
> Few nitpick belows, nonetheless:
>
> Reviewed-by: Jérôme Glisse <jglisse@xxxxxxxxxx>
> [...]
>> +
>> +CASE 3: ODP
>> +-----------
>> +(Mellanox/Infiniband On Demand Paging: the hardware supports
>> +replayable page faulting). There are GUP references to pages serving as DMA
>> +buffers. For ODP, MMU notifiers are used to synchronize with page_mkclean()
>> +and munmap(). Therefore, normal GUP calls are sufficient, so neither flag
>> +needs to be set.
>
> I would not include ODP or anything like it here, they do not use
> GUP anymore and i believe it is more confusing here. I would how-
> ever include some text in this documentation explaining that hard-
> ware that support page fault is superior as it does not incur any
> of the issues described here.

OK, agreed, here's a new write up that I'll put in v3:


CASE 3: ODP
-----------
Advanced, but non-CPU (DMA) hardware that supports replayable page faults.
Here, a well-written driver doesn't normally need to pin pages at all. However,
if the driver does choose to do so, it can register MMU notifiers for the range,
and will be called back upon invalidation. Either way (avoiding page pinning, or
using MMU notifiers to unpin upon request), there is proper synchronization with
both filesystem and mm (page_mkclean(), munmap(), etc).

Therefore, neither flag needs to be set.

It's worth mentioning here that pinning pages should not be the first design
choice. If page fault capable hardware is available, then the software should
be written so that it does not pin pages. This allows mm and filesystems to
operate more efficiently and reliably.

> [...]
>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 199da99e8ffc..1aea48427879 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>
> [...]
>
>> @@ -1014,7 +1018,16 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
>> BUG_ON(*locked != 1);
>> }
>>
>> - if (pages)
>> + /*
>> + * FOLL_PIN and FOLL_GET are mutually exclusive. Traditional behavior
>> + * is to set FOLL_GET if the caller wants pages[] filled in (but has
>> + * carelessly failed to specify FOLL_GET), so keep doing that, but only
>> + * for FOLL_GET, not for the newer FOLL_PIN.
>> + *
>> + * FOLL_PIN always expects pages to be non-null, but no need to assert
>> + * that here, as any failures will be obvious enough.
>> + */
>> + if (pages && !(flags & FOLL_PIN))
>> flags |= FOLL_GET;
>
> Did you look at user that have pages and not FOLL_GET set ?
> I believe it would be better to first fix them to end up
> with FOLL_GET set and then error out if pages is != NULL but
> nor FOLL_GET or FOLL_PIN is set.
>

I was perhaps overly cautious, and didn't go there. However, it's probably
doable, given that there was already the following in __get_user_pages():

VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));

...which will have conditioned people and code to set FOLL_GET together with
pages. So I agree that the time is right.

In order to make bisecting future failures simpler, I can insert a patch right
before this one, that changes the FOLL_GET setting into an assert, like this:

diff --git a/mm/gup.c b/mm/gup.c
index 8f236a335ae9..be338961e80d 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1014,8 +1014,8 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
BUG_ON(*locked != 1);
}

- if (pages)
- flags |= FOLL_GET;
+ if (pages && WARN_ON_ONCE(!(gup_flags & FOLL_GET)))
+ return -EINVAL;

pages_done = 0;
lock_dropped = false;


...and then add in FOLL_PIN, with this patch.

>>
>> pages_done = 0;
>
>> @@ -2373,24 +2402,9 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
>> return ret;
>> }
>>
>> -/**
>> - * get_user_pages_fast() - pin user pages in memory
>> - * @start: starting user address
>> - * @nr_pages: number of pages from start to pin
>> - * @gup_flags: flags modifying pin behaviour
>> - * @pages: array that receives pointers to the pages pinned.
>> - * Should be at least nr_pages long.
>> - *
>> - * Attempt to pin user pages in memory without taking mm->mmap_sem.
>> - * If not successful, it will fall back to taking the lock and
>> - * calling get_user_pages().
>> - *
>> - * Returns number of pages pinned. This may be fewer than the number
>> - * requested. If nr_pages is 0 or negative, returns 0. If no pages
>> - * were pinned, returns -errno.
>> - */
>> -int get_user_pages_fast(unsigned long start, int nr_pages,
>> - unsigned int gup_flags, struct page **pages)
>> +static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
>> + unsigned int gup_flags,
>> + struct page **pages)
>
> Usualy function are rename to _old_func_name ie add _ in front. So
> here it would become _get_user_pages_fast but i know some people
> don't like that as sometimes we endup with ___function_overloaded :)

Exactly: the __get_user_pages* names were already used for *non*-internal
routines, so I attempted to pick the next best naming prefix.

>
>> {
>> unsigned long addr, len, end;
>> int nr = 0, ret = 0;
>
>
>> @@ -2435,4 +2449,215 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
>
> [...]
>
>> +/**
>> + * pin_user_pages_remote() - pin pages for (typically) use by Direct IO, and
>> + * return the pages to the user.
>
> Not a fan of (typically) maybe:
> pin_user_pages_remote() - pin pages of a remote process (task != current)
>
> I think here the remote part if more important that DIO. Remote is use by
> other thing that DIO.

Yes, good point. I'll use your wording:

* pin_user_pages_remote() - pin pages of a remote process (task != current)



>
>> + *
>> + * Nearly the same as get_user_pages_remote(), except that FOLL_PIN is set. See
>> + * get_user_pages_remote() for documentation on the function arguments, because
>> + * the arguments here are identical.
>> + *
>> + * FOLL_PIN means that the pages must be released via put_user_page(). Please
>> + * see Documentation/vm/pin_user_pages.rst for details.
>> + *
>> + * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
>> + * is NOT intended for Case 2 (RDMA: long-term pins).
>> + */
>> +long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
>> + unsigned long start, unsigned long nr_pages,
>> + unsigned int gup_flags, struct page **pages,
>> + struct vm_area_struct **vmas, int *locked)
>> +{
>> + /* FOLL_GET and FOLL_PIN are mutually exclusive. */
>> + if (WARN_ON_ONCE(gup_flags & FOLL_GET))
>> + return -EINVAL;
>> +
>> + gup_flags |= FOLL_TOUCH | FOLL_REMOTE | FOLL_PIN;
>> +
>> + return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
>> + locked, gup_flags);
>> +}
>> +EXPORT_SYMBOL(pin_user_pages_remote);
>> +
>> +/**
>> + * pin_longterm_pages_remote() - pin pages for (typically) use by Direct IO, and
>> + * return the pages to the user.
>
> I think you copy pasted this from pin_user_pages_remote() :)

I admit to nothing, with respect to copy-paste! :)

This one can simply be:

* pin_longterm_pages_remote() - pin pages of a remote process (task != current)


>
>> + *
>> + * Nearly the same as get_user_pages_remote(), but note that FOLL_TOUCH is not
>> + * set, and FOLL_PIN and FOLL_LONGTERM are set. See get_user_pages_remote() for
>> + * documentation on the function arguments, because the arguments here are
>> + * identical.
>> + *
>> + * FOLL_PIN means that the pages must be released via put_user_page(). Please
>> + * see Documentation/vm/pin_user_pages.rst for further details.
>> + *
>> + * FOLL_LONGTERM means that the pages are being pinned for "long term" use,
>> + * typically by a non-CPU device, and we cannot be sure that waiting for a
>> + * pinned page to become unpin will be effective.
>> + *
>> + * This is intended for Case 2 (RDMA: long-term pins) in
>> + * Documentation/vm/pin_user_pages.rst.
>> + */
>> +long pin_longterm_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
>> + unsigned long start, unsigned long nr_pages,
>> + unsigned int gup_flags, struct page **pages,
>> + struct vm_area_struct **vmas, int *locked)
>> +{
>> + /* FOLL_GET and FOLL_PIN are mutually exclusive. */
>> + if (WARN_ON_ONCE(gup_flags & FOLL_GET))
>> + return -EINVAL;
>> +
>> + /*
>> + * FIXME: as noted in the get_user_pages_remote() implementation, it
>> + * is not yet possible to safely set FOLL_LONGTERM here. FOLL_LONGTERM
>> + * needs to be set, but for now the best we can do is a "TODO" item.
>> + */
>> + gup_flags |= FOLL_REMOTE | FOLL_PIN;
>
> Wouldn't it be better to not add pin_longterm_pages_remote() until
> it can be properly implemented ?
>

Well, the problem is that I need each call site that requires FOLL_PIN
to use a proper wrapper. It's the FOLL_PIN that is the focus here, because
there is a hard, bright rule, which is: if and only if a caller sets
FOLL_PIN, then the dma-page tracking happens, and put_user_page() must
be called.

So this leaves me with only two reasonable choices:

a) Convert the call site as above: pin_longterm_pages_remote(), which sets
FOLL_PIN (the key point!), and leaves the FOLL_LONGTERM situation exactly
as it has been so far. When the FOLL_LONGTERM situation is fixed, the call
site *might* not need any changes to adopt the working gup.c code.

b) Convert the call site to pin_user_pages_remote(), which also sets
FOLL_PIN, and also leaves the FOLL_LONGTERM situation exactly as before.
There would also be a comment at the call site, to the effect of, "this
is the wrong call to make: it really requires FOLL_LONGTERM behavior".

When the FOLL_LONGTERM situation is fixed, the call site will need to be
changed to pin_longterm_pages_remote().

So you can probably see why I picked (a).


thanks,

John Hubbard
NVIDIA