Re: [PATCH v4 5/5] hugetlb: add hugetlb demote page support

From: Mike Kravetz
Date: Fri Oct 08 2021 - 16:58:14 EST


On 10/7/21 11:19 AM, Mike Kravetz wrote:
> +static int demote_free_huge_page(struct hstate *h, struct page *page)
> +{
> + int i, nid = page_to_nid(page);
> + struct hstate *target_hstate;
> + int rc = 0;
> +
> + target_hstate = size_to_hstate(PAGE_SIZE << h->demote_order);
> +
> + remove_hugetlb_page_for_demote(h, page, false);
> + spin_unlock_irq(&hugetlb_lock);
> +
> + rc = alloc_huge_page_vmemmap(h, page);
> + if (rc) {
> + /* Allocation of vmemmmap failed, we can not demote page */
> + spin_lock_irq(&hugetlb_lock);
> + set_page_refcounted(page);
> + add_hugetlb_page(h, page, false);
> + return rc;
> + }
> +
> + /*
> + * Use destroy_compound_hugetlb_page_for_demote for all huge page
> + * sizes as it will not ref count pages.
> + */
> + destroy_compound_hugetlb_page_for_demote(page, huge_page_order(h));
> +
> + for (i = 0; i < pages_per_huge_page(h);
> + i += pages_per_huge_page(target_hstate)) {
> + if (hstate_is_gigantic(target_hstate))
> + prep_compound_gigantic_page_for_demote(page + i,
> + target_hstate->order);
> + else
> + prep_compound_page(page + i, target_hstate->order);
> + set_page_private(page + i, 0);
> + set_page_refcounted(page + i);
> + prep_new_huge_page(target_hstate, page + i, nid);
> + put_page(page + i);
> + }

I was doing some stress testing with multiple concurrent writers to
sysfs/.../nr_hugepages and sysfs/.../demote. On occasion, I would see
unexpected surplus pages of the smaller huge page size (2M on x86).

Here is what was happening. One task was decrementing the number of
2M huge pages with "echo 0 > nr_hugepages. It proceeded to the routine
set_max_huge_pages and was executing the following:

/*
* Decrease the pool size
* First return free pages to the buddy allocator (being careful
* to keep enough around to satisfy reservations). Then place
* pages into surplus state as needed so the pool will shrink
* to the desired size as pages become free.
*
* By placing pages into the surplus state independent of the
* overcommit value, we are allowing the surplus pool size to
* exceed overcommit. There are few sane options here. Since
* alloc_surplus_huge_page() is checking the global counter,
* though, we'll note that we're not allowed to exceed surplus
* and won't grow the pool anywhere else. Not until one of the
* sysctls are changed, or the surplus pages go out of use.
*/
min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
min_count = max(count, min_count);
try_to_free_low(h, min_count, nodes_allowed);

/*
* Collect pages to be removed on list without dropping lock
*/
while (min_count < persistent_huge_pages(h)) {
page = remove_pool_huge_page(h, nodes_allowed, 0);
if (!page)
break;

list_add(&page->lru, &page_list);
}
/* free the pages after dropping lock */
spin_unlock_irq(&hugetlb_lock);
update_and_free_pages_bulk(h, &page_list);
flush_free_hpage_work(h);

Now, while the lock was dropped the routine demote_free_huge_page above
added 512 huge pages to the 2M pool.

spin_lock_irq(&hugetlb_lock);

Then after acquiring the lock we make these 512 pages surplus.

while (count < persistent_huge_pages(h)) {
if (!adjust_pool_surplus(h, nodes_allowed, 1))
break;
}

To prevent this race from happening in general, the hstate specific mutex
resize_lock is held for the duration of set_max_huge_pages. Since, the
demote code is also adjusting pool sizes it should also take the mutex.
The routine demote_store already takes the mutex of the hstate of the
page size being demoted (1M in this case). That is because the 1M pool
size will be decreased. We also need to take the resize mutex of the 2M
pool as this pool will be increased. To prevent deadlocks, we use the
convention of always taking the resize mutex of the larger hstate first.

An updated version of this patch below adds taking the 'target hstate'
mutex in demote_free_huge_page. Although unnecessary, it also updates
max_huge_pages of both hstates for consistency.