Re: [PATCH v4 12/15] mm,hwpoison: Rework soft offline for in-use pages

From: HORIGUCHI NAOYA(堀口 直也)
Date: Fri Jul 17 2020 - 02:56:01 EST


On Thu, Jul 16, 2020 at 02:38:06PM +0200, Oscar Salvador wrote:
> This patch changes the way we set and handle in-use poisoned pages.
> Until now, poisoned pages were released to the buddy allocator, trusting
> that the checks that take place prior to deliver the page to its end
> user would act as a safe net and would skip that page.
>
> This has proved to be wrong, as we got some pfn walkers out there, like
> compaction, that all they care is the page to be PageBuddy and be in a
> freelist.
> Although this might not be the only user, having poisoned pages
> in the buddy allocator seems a bad idea as we should only have
> free pages that are ready and meant to be used as such.
>
> Before explaining the taken approach, let us break down the kind
> of pages we can soft offline.
>
> - Anonymous THP (after the split, they end up being 4K pages)
> - Hugetlb
> - Order-0 pages (that can be either migrated or invalited)
>
> * Normal pages (order-0 and anon-THP)
>
> - If they are clean and unmapped page cache pages, we invalidate
> then by means of invalidate_inode_page().
> - If they are mapped/dirty, we do the isolate-and-migrate dance.
>
> Either way, do not call put_page directly from those paths.
> Instead, we keep the page and send it to page_set_poison to perform the
> right handling.
>
> Among other things, page_set_poison() sets the HWPoison flag and does the last
> put_page.
> This call to put_page is mainly to be able to call __page_cache_release,
> since this function is not exported.
>
> Down the chain, we placed a check for HWPoison page in
> free_pages_prepare, that just skips any poisoned page, so those pages
> do not end up either in a pcplist or in buddy-freelist.
>
> After that, we set the refcount on the page to 1 and we increment
> the poisoned pages counter.
>
> We could do as we do for free pages:
> 1) wait until the page hits buddy's freelists
> 2) take it off
> 3) flag it
>
> The problem is that we could race with an allocation, so by the time we
> want to take the page off the buddy, the page is already allocated, so we
> cannot soft-offline it.
> This is not fatal of course, but if it is better if we can close the race
> as does not require a lot of code.
>
> * Hugetlb pages
>
> - We isolate-and-migrate them
>
> There is no magic in here, we just isolate and migrate them.
> A new set of internal functions have been made to flag a hugetlb page as
> poisoned (SetPageHugePoisoned(), PageHugePoisoned(), ClearPageHugePoisoned())
> This allows us to flag the page when we migrate it, back in
> move_hugetlb_state().
>
> Later on we check whether the page is poisoned in __free_huge_page,
> and we bail out in that case before sending the page to e.g: active
> free list.
> This gives us full control of the page, and we can handle it
> page_handle_poison().
>
> In other words, we do not allow migrated hugepages to get back to the
> freelists.
>
> Since now the page has no user and has been migrated, we can call
> dissolve_free_huge_page, which will end up calling update_and_free_page.
> In update_and_free_page(), we check for the page to be poisoned.
> If it so, we handle it as we handle gigantic pages, i.e: we break down
> the page in order-0 pages and free them one by one.
> Doing so, allows us for free_pages_prepare to skip poisoned pages.
>
> Because of the way we handle now in-use pages, we no longer need the
> put-as-isolation-migratetype dance, that was guarding for poisoned pages
> to end up in pcplists.

I ran Quan Cai's test program (https://github.com/cailca/linux-mm) on a
small (4GB memory) VM, and weiredly found that (1) the target hugepages
are not always dissolved and (2) dissovled hugetpages are still counted
in "HugePages_Total:". See below:

$ ./random 1
- start: migrate_huge_offline
- use NUMA nodes 0,1.
- mmap and free 8388608 bytes hugepages on node 0
- mmap and free 8388608 bytes hugepages on node 1
madvise: Cannot allocate memory

$ cat /proc/meminfo
MemTotal: 4026772 kB
MemFree: 976300 kB
MemAvailable: 892840 kB
Buffers: 20936 kB
Cached: 99768 kB
SwapCached: 5904 kB
Active: 84332 kB
Inactive: 116328 kB
Active(anon): 27944 kB
Inactive(anon): 68524 kB
Active(file): 56388 kB
Inactive(file): 47804 kB
Unevictable: 7532 kB
Mlocked: 0 kB
SwapTotal: 2621436 kB
SwapFree: 2609844 kB
Dirty: 56 kB
Writeback: 0 kB
AnonPages: 81764 kB
Mapped: 54348 kB
Shmem: 8948 kB
KReclaimable: 22744 kB
Slab: 52056 kB
SReclaimable: 22744 kB
SUnreclaim: 29312 kB
KernelStack: 3888 kB
PageTables: 2804 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 3260612 kB
Committed_AS: 828196 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 19260 kB
VmallocChunk: 0 kB
Percpu: 5120 kB
HardwareCorrupted: 5368 kB
AnonHugePages: 18432 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
FileHugePages: 0 kB
FilePmdMapped: 0 kB
CmaTotal: 0 kB
CmaFree: 0 kB
HugePages_Total: 1342 // still counted as hugetlb pages.
HugePages_Free: 0 // all hugepage are still allocated (or leaked?)
HugePages_Rsvd: 0
HugePages_Surp: 762 // some are counted in surplus.
Hugepagesize: 2048 kB
Hugetlb: 2748416 kB
DirectMap4k: 112480 kB
DirectMap2M: 4081664 kB


$ page-types -b hwpoison
flags page-count MB symbolic-flags long-symbolic-flags
0x0000000000080008 421 1 ___U_______________X_______________________ uptodate,hwpoison
0x00000000000a8018 1 0 ___UD__________H_G_X_______________________ uptodate,dirty,compound_head,huge,hwpoison
0x00000000000a801c 920 3 __RUD__________H_G_X_______________________ referenced,uptodate,dirty,compound_head,huge,hwpoison
total 1342 5

This means that some hugepages are dissolved, but the others not,
maybe which is not desirable.
I'll dig this more later but just let me share at first.

A few minor comment below ...

>
> Signed-off-by: Oscar Salvador <osalvador@xxxxxxxx>
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@xxxxxxx>
> ---
> include/linux/page-flags.h | 5 ----
> mm/hugetlb.c | 60 +++++++++++++++++++++++++++++++++-----
> mm/memory-failure.c | 53 +++++++++++++--------------------
> mm/migrate.c | 11 ++-----
> mm/page_alloc.c | 38 +++++++-----------------
> 5 files changed, 86 insertions(+), 81 deletions(-)
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 01baf6d426ff..2ac8bfa0cf20 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -426,13 +426,8 @@ PAGEFLAG(HWPoison, hwpoison, PF_ANY)
> TESTSCFLAG(HWPoison, hwpoison, PF_ANY)
> #define __PG_HWPOISON (1UL << PG_hwpoison)
> extern bool take_page_off_buddy(struct page *page);
> -extern bool set_hwpoison_free_buddy_page(struct page *page);
> #else
> PAGEFLAG_FALSE(HWPoison)
> -static inline bool set_hwpoison_free_buddy_page(struct page *page)
> -{
> - return 0;
> -}
> #define __PG_HWPOISON 0
> #endif
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 7badb01d15e3..1c6397936512 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -29,6 +29,7 @@
> #include <linux/numa.h>
> #include <linux/llist.h>
> #include <linux/cma.h>
> +#include <linux/migrate.h>
>
> #include <asm/page.h>
> #include <asm/pgalloc.h>
> @@ -1209,9 +1210,26 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
> ((node = hstate_next_node_to_free(hs, mask)) || 1); \
> nr_nodes--)
>
> -#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
> -static void destroy_compound_gigantic_page(struct page *page,
> - unsigned int order)
> +static inline bool PageHugePoisoned(struct page *page)
> +{
> + if (!PageHuge(page))
> + return false;
> +
> + return (unsigned long)page[3].mapping == -1U;
> +}
> +
> +static inline void SetPageHugePoisoned(struct page *page)
> +{
> + page[3].mapping = (void *)-1U;
> +}
> +
> +static inline void ClearPageHugePoisoned(struct page *page)
> +{
> + page[3].mapping = NULL;
> +}
> +
> +static void destroy_compound_gigantic_page(struct hstate *h, struct page *page,
> + unsigned int order)
> {
> int i;
> int nr_pages = 1 << order;
> @@ -1222,14 +1240,19 @@ static void destroy_compound_gigantic_page(struct page *page,
> atomic_set(compound_pincount_ptr(page), 0);
>
> for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
> + if (!hstate_is_gigantic(h))
> + p->mapping = NULL;
> clear_compound_head(p);
> set_page_refcounted(p);
> }
>
> + if (PageHugePoisoned(page))
> + ClearPageHugePoisoned(page);
> set_compound_order(page, 0);
> __ClearPageHead(page);
> }
>
> +#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
> static void free_gigantic_page(struct page *page, unsigned int order)
> {
> /*
> @@ -1284,13 +1307,16 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
> return NULL;
> }
> static inline void free_gigantic_page(struct page *page, unsigned int order) { }
> -static inline void destroy_compound_gigantic_page(struct page *page,
> - unsigned int order) { }
> +static inline void destroy_compound_gigantic_page(struct hstate *h,
> + struct page *page,
> + unsigned int order) { }
> #endif
>
> static void update_and_free_page(struct hstate *h, struct page *page)
> {
> int i;
> + bool poisoned = PageHugePoisoned(page);
> + unsigned int order = huge_page_order(h);
>
> if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
> return;
> @@ -1313,11 +1339,21 @@ static void update_and_free_page(struct hstate *h, struct page *page)
> * we might block in free_gigantic_page().
> */
> spin_unlock(&hugetlb_lock);
> - destroy_compound_gigantic_page(page, huge_page_order(h));
> - free_gigantic_page(page, huge_page_order(h));
> + destroy_compound_gigantic_page(h, page, order);
> + free_gigantic_page(page, order);
> spin_lock(&hugetlb_lock);
> } else {
> - __free_pages(page, huge_page_order(h));
> + if (unlikely(poisoned)) {
> + /*
> + * If the hugepage is poisoned, do as we do for
> + * gigantic pages and free the pages as order-0.
> + * free_pages_prepare will skip over the poisoned ones.
> + */
> + destroy_compound_gigantic_page(h, page, order);

This function is for gigantic page from its name, so shouldn't be called
for non-gigantic huge page. Maybe renaming it and/or introducing some inner
function layer to factor out common part would be better.

> + free_contig_range(page_to_pfn(page), 1 << order);
> + } else {
> + __free_pages(page, huge_page_order(h));
> + }
> }
> }
>
> @@ -1427,6 +1463,11 @@ static void __free_huge_page(struct page *page)
> if (restore_reserve)
> h->resv_huge_pages++;
>
> + if (PageHugePoisoned(page)) {
> + spin_unlock(&hugetlb_lock);
> + return;
> + }
> +
> if (PageHugeTemporary(page)) {
> list_del(&page->lru);
> ClearPageHugeTemporary(page);
> @@ -5642,6 +5683,9 @@ void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason)
> hugetlb_cgroup_migrate(oldpage, newpage);
> set_page_owner_migrate_reason(newpage, reason);
>
> + if (reason == MR_MEMORY_FAILURE)
> + SetPageHugePoisoned(oldpage);
> +
> /*
> * transfer temporary state of the new huge page. This is
> * reverse to other transitions because the newpage is going to
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index caf012d34607..c0ebab4eed4c 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -65,9 +65,17 @@ int sysctl_memory_failure_recovery __read_mostly = 1;
>
> atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
>
> -static void page_handle_poison(struct page *page)
> +static void page_handle_poison(struct page *page, bool release, bool set_flag,
> + bool huge_flag)
> {
> - SetPageHWPoison(page);
> + if (set_flag)
> + SetPageHWPoison(page);
> +
> + if (huge_flag)
> + dissolve_free_huge_page(page);
> + else if (release)
> + put_page(page);
> +

Indentation seems to be broken, you can run checkpatch.pl to find details.

Thanks,
Naoya Horiguchi

> page_ref_inc(page);
> num_poisoned_pages_inc();
> }
> @@ -1717,7 +1725,7 @@ static int get_any_page(struct page *page, unsigned long pfn)
>
> static int soft_offline_huge_page(struct page *page)
> {
> - int ret;
> + int ret = -EBUSY;
> unsigned long pfn = page_to_pfn(page);
> struct page *hpage = compound_head(page);
> LIST_HEAD(pagelist);
> @@ -1757,19 +1765,12 @@ static int soft_offline_huge_page(struct page *page)
> ret = -EIO;
> } else {
> /*
> - * We set PG_hwpoison only when the migration source hugepage
> - * was successfully dissolved, because otherwise hwpoisoned
> - * hugepage remains on free hugepage list, then userspace will
> - * find it as SIGBUS by allocation failure. That's not expected
> - * in soft-offlining.
> + * At this point the page cannot be in-use since we do not
> + * let the page to go back to hugetlb freelists.
> + * In that case we just need to dissolve it.
> + * page_handle_poison will take care of it.
> */
> - ret = dissolve_free_huge_page(page);
> - if (!ret) {
> - if (set_hwpoison_free_buddy_page(page))
> - num_poisoned_pages_inc();
> - else
> - ret = -EBUSY;
> - }
> + page_handle_poison(page, true, true, true);
> }
> return ret;
> }
> @@ -1804,10 +1805,8 @@ static int __soft_offline_page(struct page *page)
> * would need to fix isolation locking first.
> */
> if (ret == 1) {
> - put_page(page);
> pr_info("soft_offline: %#lx: invalidated\n", pfn);
> - SetPageHWPoison(page);
> - num_poisoned_pages_inc();
> + page_handle_poison(page, true, true, false);
> return 0;
> }
>
> @@ -1838,7 +1837,9 @@ static int __soft_offline_page(struct page *page)
> list_add(&page->lru, &pagelist);
> ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
> MIGRATE_SYNC, MR_MEMORY_FAILURE);
> - if (ret) {
> + if (!ret) {
> + page_handle_poison(page, true, true, false);
> + } else {
> if (!list_empty(&pagelist))
> putback_movable_pages(&pagelist);
>
> @@ -1857,37 +1858,25 @@ static int __soft_offline_page(struct page *page)
> static int soft_offline_in_use_page(struct page *page)
> {
> int ret;
> - int mt;
> struct page *hpage = compound_head(page);
>
> if (!PageHuge(page) && PageTransHuge(hpage))
> if (try_to_split_thp_page(page, "soft offline") < 0)
> return -EBUSY;
>
> - /*
> - * Setting MIGRATE_ISOLATE here ensures that the page will be linked
> - * to free list immediately (not via pcplist) when released after
> - * successful page migration. Otherwise we can't guarantee that the
> - * page is really free after put_page() returns, so
> - * set_hwpoison_free_buddy_page() highly likely fails.
> - */
> - mt = get_pageblock_migratetype(page);
> - set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> if (PageHuge(page))
> ret = soft_offline_huge_page(page);
> else
> ret = __soft_offline_page(page);
> - set_pageblock_migratetype(page, mt);
> return ret;
> }
>
> static int soft_offline_free_page(struct page *page)
> {
> int rc = -EBUSY;
> - int rc = dissolve_free_huge_page(page);
>
> if (!dissolve_free_huge_page(page) && take_page_off_buddy(page)) {
> - page_handle_poison(page);
> + page_handle_poison(page, false, true, false);
> rc = 0;
> }
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 75c10d81e833..a68d81d0ae6e 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1222,16 +1222,11 @@ static int unmap_and_move(new_page_t get_new_page,
> * we want to retry.
> */
> if (rc == MIGRATEPAGE_SUCCESS) {
> - put_page(page);
> - if (reason == MR_MEMORY_FAILURE) {
> + if (reason != MR_MEMORY_FAILURE)
> /*
> - * Set PG_HWPoison on just freed page
> - * intentionally. Although it's rather weird,
> - * it's how HWPoison flag works at the moment.
> + * We handle poisoned pages in page_handle_poison.
> */
> - if (set_hwpoison_free_buddy_page(page))
> - num_poisoned_pages_inc();
> - }
> + put_page(page);
> } else {
> if (rc != -EAGAIN) {
> if (likely(!__PageMovable(page))) {
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4fa0e0887c07..11df51fc2718 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1175,6 +1175,16 @@ static __always_inline bool free_pages_prepare(struct page *page,
>
> trace_mm_page_free(page, order);
>
> + if (unlikely(PageHWPoison(page)) && !order) {
> + /*
> + * Untie memcg state and reset page's owner
> + */
> + if (memcg_kmem_enabled() && PageKmemcg(page))
> + __memcg_kmem_uncharge_page(page, order);
> + reset_page_owner(page, order);
> + return false;
> + }
> +
> /*
> * Check tail pages before head page information is cleared to
> * avoid checking PageCompound for order-0 pages.
> @@ -8844,32 +8854,4 @@ bool take_page_off_buddy(struct page *page)
> spin_unlock_irqrestore(&zone->lock, flags);
> return ret;
> }
> -
> -/*
> - * Set PG_hwpoison flag if a given page is confirmed to be a free page. This
> - * test is performed under the zone lock to prevent a race against page
> - * allocation.
> - */
> -bool set_hwpoison_free_buddy_page(struct page *page)
> -{
> - struct zone *zone = page_zone(page);
> - unsigned long pfn = page_to_pfn(page);
> - unsigned long flags;
> - unsigned int order;
> - bool hwpoisoned = false;
> -
> - spin_lock_irqsave(&zone->lock, flags);
> - for (order = 0; order < MAX_ORDER; order++) {
> - struct page *page_head = page - (pfn & ((1 << order) - 1));
> -
> - if (PageBuddy(page_head) && page_order(page_head) >= order) {
> - if (!TestSetPageHWPoison(page))
> - hwpoisoned = true;
> - break;
> - }
> - }
> - spin_unlock_irqrestore(&zone->lock, flags);
> -
> - return hwpoisoned;
> -}
> #endif
> --
> 2.26.2
>