Re: [PATCH v3] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY

From: Mike Kravetz
Date: Mon May 24 2021 - 14:08:10 EST


On 5/21/21 12:44 AM, Mina Almasry wrote:
> The userfaultfd hugetlb tests detect a resv_huge_pages underflow. This
> happens when hugetlb_mcopy_atomic_pte() is called with !is_continue on
> an index for which we already have a page in the cache. When this
> happens, we allocate a second page, double consuming the reservation,
> and then fail to insert the page into the cache and return -EEXIST.
>
> To fix this, we first if there exists a page in the cache which already
> consumed the reservation, and return -EEXIST immediately if so.

Thanks Mina.

Andrew brought up the possibility of two separate fixes:
1) A simple one to address the common cause of the underflow
2) A more extensive modification to address all the error path code

I think this is a good idea. You mentioned that the underflow is
transient and will correct itself. That is true, however code assumes
the reserve count is an unsigned value always >= the number of free
pages. If code like the following is run while in this transitive state,
we will run into more serious issues:

/*
* A child process with MAP_PRIVATE mappings created by their parent
* have no page reserves. This check ensures that reservations are
* not "stolen". The child may still get SIGKILLed
*/
if (!vma_has_reserves(vma, chg) &&
h->free_huge_pages - h->resv_huge_pages == 0)
goto err;

/* If reserves cannot be used, ensure enough pages are in the pool */
if (avoid_reserve && h->free_huge_pages - h->resv_huge_pages == 0)
goto err;

For this reason, I think the simple change sent to stable would be a
good idea.

Do note that the call to hugetlbfs_pagecache_present() which would be
used in the stable fix is not 100% correct. See below.

> Secondly, if we fail to copy the page contents while holding the
> hugetlb_fault_mutex, we will drop the mutex and return to the caller
> after allocating a page that consumed a reservation. In this case there
> may be a fault that double consumes the reservation. To handle this, we
> free the allocated page, fix the reservations, and allocate a temporary
> hugetlb page and return that to the caller. When the caller does the
> copy outside of the lock, we again check the cache, and allocate a page
> consuming the reservation, and copy over the contents.
>
> Test:
> Hacked the code locally such that resv_huge_pages underflows produce
> a warning and the copy_huge_page_from_user() always fails, then:
>
> ./tools/testing/selftests/vm/userfaultfd hugetlb_shared 10
> 2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
> ./tools/testing/selftests/vm/userfaultfd hugetlb 10
> 2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
>
> Both tests succeed and produce no warnings. After the test runs
> number of free/resv hugepages is correct.
>
> Signed-off-by: Mina Almasry <almasrymina@xxxxxxxxxx>
> Cc: Axel Rasmussen <axelrasmussen@xxxxxxxxxx>
> Cc: Peter Xu <peterx@xxxxxxxxxx>
> Cc: linux-mm@xxxxxxxxx
> Cc: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> Cc: linux-mm@xxxxxxxxx
> Cc: linux-kernel@xxxxxxxxxxxxxxx
>
> ---
> include/linux/hugetlb.h | 4 ++
> mm/hugetlb.c | 103 ++++++++++++++++++++++++++++++++++++----
> mm/migrate.c | 39 +++------------
> 3 files changed, 103 insertions(+), 43 deletions(-)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index b92f25ccef58..427974510965 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -194,6 +194,8 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> bool is_hugetlb_entry_migration(pte_t pte);
> void hugetlb_unshare_all_pmds(struct vm_area_struct *vma);
>
> +void hugetlb_copy_page(struct page *dst, struct page *src);
> +
> #else /* !CONFIG_HUGETLB_PAGE */
>
> static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
> @@ -379,6 +381,8 @@ static inline vm_fault_t hugetlb_fault(struct mm_struct *mm,
>
> static inline void hugetlb_unshare_all_pmds(struct vm_area_struct *vma) { }
>
> +static inline void hugetlb_copy_page(struct page *dst, struct page *src);
> +
> #endif /* !CONFIG_HUGETLB_PAGE */
> /*
> * hugepages at page global directory. If arch support

How about just making the existing routine copy_huge_page() callable from
outside migrate.c and avoid all this code movement? If we want to do
any code movement, I would suggest moving copy_huge_page and routines it
depends on to mm/huge_memory.c. That seems like a more appropriate
location. But, I am fine with leaving them in migrate.c

> index 629aa4c2259c..cb041c97a558 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
< snip code dealing with movement of copy routines >
> @@ -4868,19 +4907,20 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> struct page **pagep)
> {
> bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
> - struct address_space *mapping;
> - pgoff_t idx;
> + struct hstate *h = hstate_vma(dst_vma);
> + struct address_space *mapping = dst_vma->vm_file->f_mapping;
> + pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr);
> unsigned long size;
> int vm_shared = dst_vma->vm_flags & VM_SHARED;
> - struct hstate *h = hstate_vma(dst_vma);
> pte_t _dst_pte;
> spinlock_t *ptl;
> - int ret;
> + int ret = -ENOMEM;
> struct page *page;
> int writable;
> -
> - mapping = dst_vma->vm_file->f_mapping;
> - idx = vma_hugecache_offset(h, dst_vma, dst_addr);
> + struct mempolicy *mpol;
> + nodemask_t *nodemask;
> + gfp_t gfp_mask = htlb_alloc_mask(h);
> + int node = huge_node(dst_vma, dst_addr, gfp_mask, &mpol, &nodemask);
>
> if (is_continue) {
> ret = -EFAULT;
> @@ -4888,7 +4928,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> if (!page)
> goto out;
> } else if (!*pagep) {
> - ret = -ENOMEM;
> + /* If a page already exists, then it's UFFDIO_COPY for
> + * a non-missing case. Return -EEXIST.
> + */
> + if (hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {

We only want to call hugetlbfs_pagecache_present() if vm_shared.
Something like:

if (vm_shared &&
hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {

This could be a private mapping of a file in which case we do not care
about the contents of the cache/file. In fact, it could provide a false
positive.

> + ret = -EEXIST;
> + goto out;
> + }
> +
> page = alloc_huge_page(dst_vma, dst_addr, 0);
> if (IS_ERR(page))
> goto out;
> @@ -4900,12 +4947,48 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> /* fallback to copy_from_user outside mmap_lock */
> if (unlikely(ret)) {
> ret = -ENOENT;
> + /* Free the allocated page which may have
> + * consumed a reservation.
> + */
> + restore_reserve_on_error(h, dst_vma, dst_addr, page);

Good!

> + if (!HPageRestoreReserve(page)) {
> + if (unlikely(hugetlb_unreserve_pages(
> + mapping->host, idx, idx + 1, 1)))
> + hugetlb_fix_reserve_counts(
> + mapping->host);
> + }

I do not understand the need to call hugetlb_unreserve_pages(). The
call to restore_reserve_on_error 'should' fix up the reserve map to
align with restoring the reserve count in put_page/free_huge_page.
Can you explain why that is there?

> + put_page(page);
> +
> + /* Allocate a temporary page to hold the copied
> + * contents.
> + */
> + page = alloc_migrate_huge_page(h, gfp_mask, node,
> + nodemask);

I would suggest calling alloc_huge_page_vma to allocate the page here.
alloc_huge_page_vma will allocate a huge page from the pool if one is
available, and fall back to the buddy allocator. This is the way the
migration code works.

alloc_huge_page_vma has the advantage that it 'can' work for gigantic
pages if some are available in the pool. Note that alloc_migrate_huge_page
immediately checks for and fails if a gigantic page is needed.

If we use alloc_migrate_huge_page, the userfaultfd selftest you are
using will fail when tested with 'hacked code' to ALWAYS take this
error path. That is because we will need two huge pages (one
temporary) and only one is available. I believe this is OK. As
mentioned, this is how the migration code works.

> + if (IS_ERR(page)) {
> + ret = -ENOMEM;
> + goto out;
> + }
> *pagep = page;
> - /* don't free the page */
> + /* Set the outparam pagep and return to the caller to
> + * copy the contents outside the lock. Don't free the
> + * page.
> + */
> goto out;
> }
> } else {
> - page = *pagep;
> + if (hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {

Again, check for vm_shared before calling hugetlbfs_pagecache_present.

> + put_page(*pagep);
> + ret = -EEXIST;
> + goto out;

We need to set *pagep = NULL before returning. The calling routine
__mcopy_atomic_hugetlb will BUG() if we return with an error code other
than -ENOENT and *pagep not NULL.

> + }
> +
> + page = alloc_huge_page(dst_vma, dst_addr, 0);
> + if (IS_ERR(page)) {
> + ret = -ENOMEM;

Again, clear *pagep before returning.

> + goto out;
> + }
> + __copy_gigantic_page(page, *pagep, pages_per_huge_page(h));

See previous discussion about copy routine.

> + put_page(*pagep);
> *pagep = NULL;
> }

Later in hugetlb_mcopy_atomic_pte, we will update the page cache and
page table. There are error checks when performing these operations.
It is unlikely we will ever hit one of these errors. However, if we
do encounter an error, the code should call restore_reserve_on_error
before the put_page. Perhaps something like:

@@ -4996,6 +4996,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
if (vm_shared || is_continue)
unlock_page(page);
out_release_nounlock:
+ restore_reserve_on_error(h, dst_vma, dst_addr, page);
put_page(page);
goto out;
}


>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 6b37d00890ca..d3437f9a608d 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
< snip code dealing with movement of copy routines >

With changes like those above in hugetlb_mcopy_atomic_pte, we should be
able to remove the following code in __mcopy_atomic_hugetlb. We can do
this because any page returned/passed to __mcopy_atomic_hugetlb will NOT
have consumed a reserve. It would be a nice cleanup as that big comment
explains how we currently guess what is the right thing to do in this
situation.

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 63a73e164d55..e13a0492b7ba 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -346,54 +346,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
out_unlock:
mmap_read_unlock(dst_mm);
out:
- if (page) {
- /*
- * We encountered an error and are about to free a newly
- * allocated huge page.
- *
- * Reservation handling is very subtle, and is different for
- * private and shared mappings. See the routine
- * restore_reserve_on_error for details. Unfortunately, we
- * can not call restore_reserve_on_error now as it would
- * require holding mmap_lock.
- *
- * If a reservation for the page existed in the reservation
- * map of a private mapping, the map was modified to indicate
- * the reservation was consumed when the page was allocated.
- * We clear the HPageRestoreReserve flag now so that the global
- * reserve count will not be incremented in free_huge_page.
- * The reservation map will still indicate the reservation
- * was consumed and possibly prevent later page allocation.
- * This is better than leaking a global reservation. If no
- * reservation existed, it is still safe to clear
- * HPageRestoreReserve as no adjustments to reservation counts
- * were made during allocation.
- *
- * The reservation map for shared mappings indicates which
- * pages have reservations. When a huge page is allocated
- * for an address with a reservation, no change is made to
- * the reserve map. In this case HPageRestoreReserve will be
- * set to indicate that the global reservation count should be
- * incremented when the page is freed. This is the desired
- * behavior. However, when a huge page is allocated for an
- * address without a reservation a reservation entry is added
- * to the reservation map, and HPageRestoreReserve will not be
- * set. When the page is freed, the global reserve count will
- * NOT be incremented and it will appear as though we have
- * leaked reserved page. In this case, set HPageRestoreReserve
- * so that the global reserve count will be incremented to
- * match the reservation map entry which was created.
- *
- * Note that vm_alloc_shared is based on the flags of the vma
- * for which the page was originally allocated. dst_vma could
- * be different or NULL on error.
- */
- if (vm_alloc_shared)
- SetHPageRestoreReserve(page);
- else
- ClearHPageRestoreReserve(page);
+ if (page)
put_page(page);
- }
BUG_ON(copied < 0);
BUG_ON(err > 0);
BUG_ON(!copied && !err);

--
Mike Kravetz