Re: [PATCH v1 2/3] mm/gup: use gup_can_follow_protnone() also in GUP-fast

From: Jason Gunthorpe
Date: Tue Aug 30 2022 - 19:44:54 EST


On Tue, Aug 30, 2022 at 09:23:44PM +0200, David Hildenbrand wrote:
> @@ -2997,6 +2997,11 @@ static inline bool gup_must_unshare(unsigned int flags, struct page *page)
> */
> if (!PageAnon(page))
> return false;
> +
> + /* See page_try_share_anon_rmap() for GUP-fast details. */
> + if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) && irqs_disabled())
> + smp_rmb();
> +
> /*
> * Note that PageKsm() pages cannot be exclusive, and consequently,
> * cannot get pinned.
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index bf80adca980b..454c159f2aae 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -267,7 +267,7 @@ static inline int page_try_dup_anon_rmap(struct page *page, bool compound,
> * @page: the exclusive anonymous page to try marking possibly shared
> *
> * The caller needs to hold the PT lock and has to have the page table entry
> - * cleared/invalidated+flushed, to properly sync against GUP-fast.
> + * cleared/invalidated.
> *
> * This is similar to page_try_dup_anon_rmap(), however, not used during fork()
> * to duplicate a mapping, but instead to prepare for KSM or temporarily
> @@ -283,12 +283,60 @@ static inline int page_try_share_anon_rmap(struct page *page)
> {
> VM_BUG_ON_PAGE(!PageAnon(page) || !PageAnonExclusive(page), page);
>
> - /* See page_try_dup_anon_rmap(). */
> - if (likely(!is_device_private_page(page) &&
> - unlikely(page_maybe_dma_pinned(page))))
> - return -EBUSY;
> + /* device private pages cannot get pinned via GUP. */
> + if (unlikely(is_device_private_page(page))) {
> + ClearPageAnonExclusive(page);
> + return 0;
> + }
>
> + /*
> + * We have to make sure that while we clear PageAnonExclusive, that
> + * the page is not pinned and that concurrent GUP-fast won't succeed in
> + * concurrently pinning the page.
> + *
> + * Conceptually, GUP-fast pinning code of anon pages consists of:
> + * (1) Read the PTE
> + * (2) Pin the mapped page
> + * (3) Check if the PTE changed by re-reading it; back off if so.
> + * (4) Check if PageAnonExclusive is not set; back off if so.
> + *
> + * Conceptually, PageAnonExclusive clearing code consists of:
> + * (1) Clear PTE
> + * (2) Check if the page is pinned; back off if so.
> + * (3) Clear PageAnonExclusive
> + * (4) Restore PTE (optional)
> + *
> + * In GUP-fast, we have to make sure that (2),(3) and (4) happen in
> + * the right order. Memory order between (2) and (3) is handled by
> + * GUP-fast, independent of PageAnonExclusive.
> + *
> + * When clearing PageAnonExclusive(), we have to make sure that (1),
> + * (2), (3) and (4) happen in the right order.
> + *
> + * Note that (4) has to happen after (3) in both cases to handle the
> + * corner case whereby the PTE is restored to the original value after
> + * clearing PageAnonExclusive and while GUP-fast might not detect the
> + * PTE change, it will detect the PageAnonExclusive change.
> + *
> + * We assume that there might not be a memory barrier after
> + * clearing/invalidating the PTE (1) and before restoring the PTE (4),
> + * so we use explicit ones here.
> + *
> + * These memory barriers are paired with memory barriers in GUP-fast
> + * code, including gup_must_unshare().
> + */
> +
> + /* Clear/invalidate the PTE before checking for PINs. */
> + if (IS_ENABLED(CONFIG_HAVE_FAST_GUP))
> + smp_mb();
> +
> + if (unlikely(page_maybe_dma_pinned(page)))
> + return -EBUSY;

It is usually a bad sign to see an attempt to create a "read release"..

> ClearPageAnonExclusive(page);
> +
> + /* Clear PageAnonExclusive() before eventually restoring the PTE. */
> + if (IS_ENABLED(CONFIG_HAVE_FAST_GUP))
> + smp_mb__after_atomic();
> return 0;
> }

I don't know enough about the memory model to say if this is OK..

Generally, I've never seen an algorithm be successfull with these
kinds of multi-atomic gyrations.

If we break it down a bit, and replace the 'read release' with an
actual atomic for discussion:


CPU0 CPU1
clear pte
incr_return ref // release & acquire
add_ref // acquire

This seems OK, if CPU1 views !dma then CPU0 must view clear pte due to
the atomic's release/acquire semantic

If CPU1 views dma then it just exits


Now the second phase:

CPU0 CPU1
clear anon_exclusive
restore pte // release

read_pte // acquire
read anon_exclusive

If CPU0 observes the restored PTE then it must observe the cleared
anon_exclusive

Otherwise CPU0 must observe the cleared PTE.

So, maybe I could convince myself it is OK, but I think your placement
of barriers is confusing as to what data the barrier is actually
linked to.

We are using a barrier around the ref - acquire on the CPU0 and full
barier on the CPU1 (eg atomic_read(); smb_mb_after_atomic() )

The second phase uses a smp_store_release/load_acquire on the PTE.

It is the same barriers you sketched but integrated with the data they
are ordering.

Jason