Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration

From: Michal Hocko
Date: Wed Nov 29 2017 - 02:17:44 EST


On Tue 28-11-17 17:39:50, Mike Kravetz wrote:
> On 11/28/2017 06:12 AM, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@xxxxxxxx>
> >
> > hugepage migration relies on __alloc_buddy_huge_page to get a new page.
> > This has 2 main disadvantages.
> > 1) it doesn't allow to migrate any huge page if the pool is used
> > completely which is not an exceptional case as the pool is static and
> > unused memory is just wasted.
> > 2) it leads to a weird semantic when migration between two numa nodes
> > might increase the pool size of the destination NUMA node while the page
> > is in use. The issue is caused by per NUMA node surplus pages tracking
> > (see free_huge_page).
> >
> > Address both issues by changing the way how we allocate and account
> > pages allocated for migration. Those should temporal by definition.
> > So we mark them that way (we will abuse page flags in the 3rd page)
> > and update free_huge_page to free such pages to the page allocator.
> > Page migration path then just transfers the temporal status from the
> > new page to the old one which will be freed on the last reference.
>
> In general, I think this will work. Some questions below.
>
> > The global surplus count will never change during this path but we still
> > have to be careful when freeing a page from a node with surplus pages
> > on the node.
>
> Not sure about the "freeing page from a node with surplus pages" comment.
> If allocating PageHugeTemporary pages does not adjust surplus counts, then
> there should be no concern at the time of freeing.
>
> Could this comment be a hold over from a previous implementation attempt?
>

Not really. You have to realize that the original page could be surplus
on its node. More on that below.

[...]
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 8189c92fac82..037bf0f89463 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1283,7 +1283,13 @@ void free_huge_page(struct page *page)
> > if (restore_reserve)
> > h->resv_huge_pages++;
> >
> > - if (h->surplus_huge_pages_node[nid]) {
> > + if (PageHugeTemporary(page)) {
> > + list_del(&page->lru);
> > + ClearPageHugeTemporary(page);
> > + update_and_free_page(h, page);
> > + if (h->surplus_huge_pages_node[nid])
> > + h->surplus_huge_pages_node[nid]--;
>
> I think this is not correct. Should the lines dealing with per-node
> surplus counts even be here? If the lines above are correct, then it
> implies that the sum of per node surplus counts could exceed (or get out
> of sync with) the global surplus count.

You are right, I guess. This per-node accounting makes the whole thing
real pain. I am worried that we will free next page from the same node
and reduce the overal pool size. I will think about it some more.

> > + } else if (h->surplus_huge_pages_node[nid]) {
> > /* remove the page from active list */
> > list_del(&page->lru);
> > update_and_free_page(h, page);
> > @@ -1531,7 +1537,11 @@ int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn)
> > return rc;
> > }
> >
> > -static struct page *__alloc_buddy_huge_page(struct hstate *h, gfp_t gfp_mask,
> > +/*
> > + * Allocates a fresh surplus page from the page allocator. Temporary
> > + * requests (e.g. page migration) can pass enforce_overcommit == false
>
> 'enforce_overcommit == false' perhaps part of an earlier implementation
> attempt?

yeah.

[...]

> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 4d0be47a322a..b3345f8174a9 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -1326,6 +1326,19 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
> > hugetlb_cgroup_migrate(hpage, new_hpage);
> > put_new_page = NULL;
> > set_page_owner_migrate_reason(new_hpage, reason);
> > +
> > + /*
> > + * transfer temporary state of the new huge page. This is
> > + * reverse to other transitions because the newpage is going to
> > + * be final while the old one will be freed so it takes over
> > + * the temporary status.
> > + * No need for any locking here because destructor cannot race
> > + * with us.
> > + */
> > + if (PageHugeTemporary(new_hpage)) {
> > + SetPageHugeTemporary(hpage);
> > + ClearPageHugeTemporary(new_hpage);
> > + }
> > }
> >
> > unlock_page(hpage);
> >
>
> I'm still trying to wrap my head around all the different scenarios.
> In general, this new code only 'kicks in' if the there is not a free
> pre-allocated huge page for migration. Right?

yes

> So, if there are free huge pages they are 'consumed' during migration
> and the number of available pre-allocated huge pages is reduced? Or,
> is that not exactly how it works? Or does it depend in the purpose
> of the migration?

Well, if we have pre-allocated pages then we just consume them and they
will not get Temporary status so the additional code doesn't kick in.

> The only reason I ask is because this new method of allocating a surplus
> page (if successful) results in no decrease of available huge pages.
> Perhaps all migrations should attempt to allocate surplus pages and not
> impact the pre-allocated number of available huge pages.

That could reduce the chances of the migration success because
allocating a fresh huge page can fail.

--
Michal Hocko
SUSE Labs