Re: [External] Re: [PATCH v20 6/9] mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page

From: Mike Kravetz
Date: Tue Apr 20 2021 - 13:49:48 EST


On 4/20/21 1:46 AM, Muchun Song wrote:
> On Tue, Apr 20, 2021 at 7:20 AM Mike Kravetz <mike.kravetz@xxxxxxxxxx> wrote:
>>
>> On 4/15/21 1:40 AM, Muchun Song wrote:
>>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>>> index 0abed7e766b8..6e970a7d3480 100644
>>> --- a/include/linux/hugetlb.h
>>> +++ b/include/linux/hugetlb.h
>>> @@ -525,6 +525,7 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>>> * code knows it has only reference. All other examinations and
>>> * modifications require hugetlb_lock.
>>> * HPG_freed - Set when page is on the free lists.
>>> + * HPG_vmemmap_optimized - Set when the vmemmap pages of the page are freed.
>>> * Synchronization: hugetlb_lock held for examination and modification.
>>
>> I like the per-page flag. In previous versions of the series, you just
>> checked the free_vmemmap_pages_per_hpage() to determine if vmemmmap
>> should be allocated. Is there any change in functionality that makes is
>> necessary to set the flag in each page, or is it mostly for flexibility
>> going forward?
>
> Actually, only the routine of dissolving the page cares whether
> the page is on the buddy free list when update_and_free_page
> returns. But we cannot change the return type of the
> update_and_free_page (e.g. change return type from 'void' to 'int').
> Why? If the hugepage is freed through a kworker, we cannot
> know the return value when update_and_free_page returns.
> So adding a return value seems odd.
>
> In the dissolving routine, We can allocate vmemmap pages first,
> if it is successful, then we can make sure that
> update_and_free_page can successfully free page. So I need
> some stuff to mark the page which does not need to allocate
> vmemmap pages.
>
> On the surface, we seem to have a straightforward method
> to do this.
>
> Add a new parameter 'alloc_vmemmap' to update_and_free_page() to
> indicate that the caller is already allocated the vmemmap pages.
> update_and_free_page() do not need to allocate. Just like below.
>
> void update_and_free_page(struct hstate *h, struct page *page, bool atomic,
> bool alloc_vmemmap)
> {
> if (alloc_vmemmap)
> // allocate vmemmap pages
> }
>
> But if the page is freed through a kworker. How to pass
> 'alloc_vmemmap' to the kworker? We can embed this
> information into the per-page flag. So if we introduce
> HPG_vmemmap_optimized, the parameter of
> alloc_vmemmap is also necessary.
>
> So it seems that introducing HPG_vmemmap_optimized is
> a good choice.

Thanks for the explanation!

Agree that the flag is a good choice. How about adding a comment like
this above the alloc_huge_page_vmemmap call in dissolve_free_huge_page?

/*
* Normally update_and_free_page will allocate required vmemmmap before
* freeing the page. update_and_free_page will fail to free the page
* if it can not allocate required vmemmap. We need to adjust
* max_huge_pages if the page is not freed. Attempt to allocate
* vmemmmap here so that we can take appropriate action on failure.
*/

...
>>> +static void add_hugetlb_page(struct hstate *h, struct page *page,
>>> + bool adjust_surplus)
>>> +{
>>
>> We need to be a bit careful with hugepage specific flags that may be
>> set. The routine remove_hugetlb_page which is called for 'page' before
>> this routine will not clear any of the hugepage specific flags. If the
>> calling path goes through free_huge_page, most but not all flags are
>> cleared.
>>
>> We had a discussion about clearing the page->private field in Oscar's
>> series. In the case of 'new' pages we can assume page->private is
>> cleared, but perhaps we should not make that assumption here. Since we
>> hope to rarely call this routine, it might be safer to do something
>> like:
>>
>> set_page_private(page, 0);
>> SetHPageVmemmapOptimized(page);
>
> Agree. Thanks for your reminder. I will fix this.
>
>>
>>> + int nid = page_to_nid(page);
>>> +
>>> + lockdep_assert_held(&hugetlb_lock);
>>> +
>>> + INIT_LIST_HEAD(&page->lru);
>>> + h->nr_huge_pages++;
>>> + h->nr_huge_pages_node[nid]++;
>>> +
>>> + if (adjust_surplus) {
>>> + h->surplus_huge_pages++;
>>> + h->surplus_huge_pages_node[nid]++;
>>> + }
>>> +
>>> + set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
>>> +
>>> + /*
>>> + * The refcount can possibly be increased by memory-failure or
>>> + * soft_offline handlers.
>>> + */
>>> + if (likely(put_page_testzero(page))) {
>>
>> In the existing code there is no such test. Is the need for the test
>> because of something introduced in the new code?
>
> No.
>
>> Or, should this test be in the existing code?
>
> Yes. gather_surplus_pages should be fixed. I can fix it
> in a separate patch.
>
> The possible bad scenario:
>
> CPU0: CPU1:
> set_compound_page_dtor(HUGETLB_PAGE_DTOR);
> memory_failure_hugetlb
> get_hwpoison_page
> __get_hwpoison_page
> get_page_unless_zero
> put_page_testzero()
>
> put_page(page)
>
>
> More details and discussion can refer to:
>
> https://lore.kernel.org/linux-doc/CAMZfGtVRSBkKe=tKAKLY8dp_hywotq3xL+EJZNjXuSKt3HK3bQ@xxxxxxxxxxxxxx/
>

Thanks you! I did not remember that discussion.

It would be helpful to add a separate patch for gather_surplus_pages.
Otherwise, we have the VM_BUG_ON there and not in add_hugetlb_page.

--
Mike Kravetz