Re: [PATCH 8/9] memory-hotplug: enable memory hotplug to handlehugepage

From: Michal Hocko
Date: Wed Mar 20 2013 - 03:57:46 EST


On Tue 19-03-13 23:55:33, Naoya Horiguchi wrote:
> On Mon, Mar 18, 2013 at 05:07:37PM +0100, Michal Hocko wrote:
> > On Thu 21-02-13 14:41:47, Naoya Horiguchi wrote:
[...]
> > > As for larger hugepages (1GB for x86_64), it's not easy to do hotremove
> > > over them because it's larger than memory block. So we now simply leave
> > > it to fail as it is.
> >
> > What we could do is to check whether there is a free gb huge page on
> > other node and migrate there.
>
> Correct, and 1GB page migration needs more code in migration core code
> (mainly it's related to migration entry in pud) and enough testing,
> so I want to do it in separate patchset.

Sure, this was just a note that it is achievable not that it has to be
done in the patchset.

[...]
> > > diff --git v3.8.orig/mm/hugetlb.c v3.8/mm/hugetlb.c
> > > index ccf9995..c28e6c9 100644
> > > --- v3.8.orig/mm/hugetlb.c
> > > +++ v3.8/mm/hugetlb.c
> > > @@ -843,6 +843,30 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
> > > return ret;
> > > }
> > >
> > > +/* Dissolve a given free hugepage into free pages. */
> > > +void dissolve_free_huge_page(struct page *page)
> > > +{
> > > + if (PageHuge(page) && !page_count(page)) {
> >
> > Could you clarify why you are cheking page_count here? I assume it is to
> > make sure the page is free but what prevents it being increased before
> > you take hugetlb_lock?
>
> There's nothing to prevent it, so it's not safe to check refcount outside
> hugetlb_lock.

OK, so the lock has to be moved up.

[...]
> > > diff --git v3.8.orig/mm/memory_hotplug.c v3.8/mm/memory_hotplug.c
> > > index d04ed87..6418de2 100644
> > > --- v3.8.orig/mm/memory_hotplug.c
> > > +++ v3.8/mm/memory_hotplug.c
> > > @@ -29,6 +29,7 @@
> > > #include <linux/suspend.h>
> > > #include <linux/mm_inline.h>
> > > #include <linux/firmware-map.h>
> > > +#include <linux/hugetlb.h>
> > >
> > > #include <asm/tlbflush.h>
> > >
> > > @@ -985,10 +986,12 @@ static int test_pages_in_a_zone(unsigned long start_pfn, unsigned long end_pfn)
> > > }
> > >
> > > /*
> > > - * Scanning pfn is much easier than scanning lru list.
> > > - * Scan pfn from start to end and Find LRU page.
> > > + * Scan pfn range [start,end) to find movable/migratable pages (LRU pages
> > > + * and hugepages). We scan pfn because it's much easier than scanning over
> > > + * linked list. This function returns the pfn of the first found movable
> > > + * page if it's found, otherwise 0.
> > > */
> > > -static unsigned long scan_lru_pages(unsigned long start, unsigned long end)
> > > +static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
> > > {
> > > unsigned long pfn;
> > > struct page *page;
> > > @@ -997,6 +1000,12 @@ static unsigned long scan_lru_pages(unsigned long start, unsigned long end)
> > > page = pfn_to_page(pfn);
> > > if (PageLRU(page))
> > > return pfn;
> > > + if (PageHuge(page)) {
> > > + if (is_hugepage_movable(page))
> > > + return pfn;
> > > + else
> > > + pfn += (1 << compound_order(page)) - 1;
> > > + }
> >
> > scan_lru_pages's name gets really confusing after this change because
> > hugetlb pages are not on the LRU. Maybe it would be good to rename it.
>
> Yes, and that's done in right above chunk.

bahh, I am blind. I got confused by the name in the hunk header. Sorry
about that.

>
> >
> > > }
> > > }
> > > return 0;
> > > @@ -1019,6 +1028,30 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
> > > page = pfn_to_page(pfn);
> > > if (!get_page_unless_zero(page))
> > > continue;
> >
> > All tail pages have 0 reference count (according to prep_compound_page)
> > so they would be skipped anyway. This makes the below pfn tweaks
> > pointless.
>
> I was totally mistaken about what we should do here, sorry. If we call
> do_migrate_range() for 1GB hugepage, we should return with error (maybe -EBUSY)
> instead of just skipping it, otherwise the caller __offline_pages() repeats
> 'goto repeat' until timeout. In order to do that, we had better insert
> if(PageHuge) block before getting refcount. And ...
>
> > > + if (PageHuge(page)) {
> > > + /*
> > > + * Larger hugepage (1GB for x86_64) is larger than
> > > + * memory block, so pfn scan can start at the tail
> > > + * page of larger hugepage. In such case,
> > > + * we simply skip the hugepage and move the cursor
> > > + * to the last tail page.
> > > + */
> > > + if (PageTail(page)) {
> > > + struct page *head = compound_head(page);
> > > + pfn = page_to_pfn(head) +
> > > + (1 << compound_order(head)) - 1;
> > > + put_page(page);
> > > + continue;
> > > + }
> > > + pfn = (1 << compound_order(page)) - 1;
> > > + if (huge_page_size(page_hstate(page)) != PMD_SIZE) {
> > > + put_page(page);
> > > + continue;
> > > + }
> >
> > There might be other hugepage sizes which fit into memblock so this test
> > doesn't seem right.
>
> yes, so compound_order(head) > PFN_SECTION_SHIFT would be better.

I would rather see compound_order(head) < MAX_ORDER to be more coupled
with the allocator.

> I'll replace this chunk with the following if I don't get any other
> suggestion.
>
> @@ -1017,6 +1026,21 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
> if (!pfn_valid(pfn))
> continue;
> page = pfn_to_page(pfn);
> +
> + if (PageHuge(page)) {
> + struct page *head = compound_head(page);
> + pfn = page_to_pfn(head) + (1 << compound_order(head)) - 1;

I do not think this is safe without an elevated ref count. Your page
might be on the way to be freed. So you need to put get_page_unless_zero
before compound_order check.

Besides that I do not see too much point in optimizing this path on the
code complexity behalf. Sure we would call get_page_unless_zero
pointlessly for all tail pages but this is hardly a hot path.

> + if (compound_order(head) > PFN_SECTION_SHIFT) {
> + ret = -EBUSY;
> + break;
> + }
> + if (!get_page_unless_zero(page))

Should be head.

> + continue;
> + list_move_tail(&head->lru, &source);
> + move_pages -= 1 << compound_order(head);
> + continue;
> + }
> +
> if (!get_page_unless_zero(page))
> continue;
> /*
>
> Thanks,
> Naoya
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@xxxxxxxxxx For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>

--
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/