Re: [PATCH v2 4/4] mm/hwpoison: fix unpoison_memory()
From: Naoya Horiguchi
Date: Wed Oct 27 2021 - 07:59:06 EST
On Tue, Oct 26, 2021 at 09:00:37PM -0700, Yang Shi wrote:
> On Mon, Oct 25, 2021 at 4:16 PM Naoya Horiguchi <nao.horiguchi@xxxxxxxxx> wrote:
> >
> > From: Naoya Horiguchi <naoya.horiguchi@xxxxxxx>
> >
> > After recent soft-offline rework, error pages can be taken off from
> > buddy allocator, but the existing unpoison_memory() does not properly
> > undo the operation. Moreover, due to the recent change on
> > __get_hwpoison_page(), get_page_unless_zero() is hardly called for
> > hwpoisoned pages. So __get_hwpoison_page() highly likely returns zero
> > (meaning to fail to grab page refcount) and unpoison just clears
> > PG_hwpoison without releasing a refcount. That does not lead to a
> > critical issue like kernel panic, but unpoisoned pages never get back to
> > buddy (leaked permanently), which is not good.
> >
> > To (partially) fix this, we need to identify "taken off" pages from
> > other types of hwpoisoned pages. We can't use refcount or page flags
> > for this purpose, so a pseudo flag is defined by hacking ->private
> > field. Someone might think that put_page() is enough to cancel
> > taken-off pages, but the normal free path contains some operations not
> > suitable for the current purpose, and can fire VM_BUG_ON().
> >
> > Note that unpoison_memory() is now supposed to be cancel hwpoison events
> > injected only by madvise() or /sys/devices/system/memory/{hard,soft}_offline_page,
> > not by MCE injection, so please don't try to use unpoison when testing
> > with MCE injection.
> >
> > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@xxxxxxx>
> > ---
> > ChangeLog v2:
> > - unpoison_memory() returns as commented
> > - explicitly avoids unpoisoning slab pages
> > - separates internal pinning function into __get_unpoison_page()
> > ---
...
> > @@ -1262,6 +1278,27 @@ static int get_any_page(struct page *p, unsigned long flags)
> > return ret;
> > }
> >
> > +static int __get_unpoison_page(struct page *page)
> > +{
> > + struct page *head = compound_head(page);
> > + int ret = 0;
> > + bool hugetlb = false;
> > +
> > + ret = get_hwpoison_huge_page(head, &hugetlb);
> > + if (hugetlb)
> > + return ret;
> > +
> > + /*
> > + * PageHWPoisonTakenOff pages are not only marked as PG_hwpoison,
> > + * but also isolated from buddy freelist, so need to identify the
> > + * state and have to cancel both operations to unpoison.
> > + */
> > + if (PageHWPoisonTakenOff(head))
> > + return -EHWPOISON;
>
> I don't think we could see compound page here, but checking head page
> might be confusing since private is per subpage, so it might be better
> to use page instead of head.
OK, I'll do this (and pass page to get_page_unless_zero() too).
> > +
> > + return get_page_unless_zero(head) ? 1 : 0;
> > +}
> > +
> > /**
> > * get_hwpoison_page() - Get refcount for memory error handling
> > * @p: Raw error page (hit by memory error)
> > @@ -1278,18 +1315,26 @@ static int get_any_page(struct page *p, unsigned long flags)
> > * extra care for the error page's state (as done in __get_hwpoison_page()),
> > * and has some retry logic in get_any_page().
> > *
> > + * When called from unpoison_memory(), the caller should already ensure that
> > + * the given page has PG_hwpoison. So it's never reused for other page
> > + * allocations, and __get_unpoison_page() never races with them.
> > + *
> > * Return: 0 on failure,
> > * 1 on success for in-use pages in a well-defined state,
> > * -EIO for pages on which we can not handle memory errors,
> > * -EBUSY when get_hwpoison_page() has raced with page lifecycle
> > - * operations like allocation and free.
> > + * operations like allocation and free,
> > + * -EHWPOISON when the page is hwpoisoned and taken off from buddy.
> > */
> > static int get_hwpoison_page(struct page *p, unsigned long flags)
> > {
> > int ret;
> >
> > zone_pcp_disable(page_zone(p));
> > - ret = get_any_page(p, flags);
> > + if (flags & MF_UNPOISON)
> > + ret = __get_unpoison_page(p);
> > + else
> > + ret = get_any_page(p, flags);
> > zone_pcp_enable(page_zone(p));
> >
> > return ret;
> > @@ -1942,6 +1987,26 @@ core_initcall(memory_failure_init);
> > pr_info(fmt, pfn); \
> > })
> >
> > +static inline int clear_page_hwpoison(struct ratelimit_state *rs, struct page *p)
> > +{
> > + if (TestClearPageHWPoison(p)) {
> > + unpoison_pr_info("Unpoison: Software-unpoisoned page %#lx\n",
> > + page_to_pfn(p), rs);
> > + num_poisoned_pages_dec();
> > + return 0;
> > + }
> > + return -EBUSY;
>
> I don't quite get why -EBUSY is returned. TestClear returns the old
> value, so returning 0 means the flag was cleared before. It is fine,
> right? I don't see why we have to return different values.
I think clear_page_hwpoison()'s return value is used in the path where
get_hwpoison_page(MF_UNPOISON) returns 1. But as you mentioned -EBUSY might
not be a good return value. And I noticed that freeit's semantics is wrongly
reversed, that's just a bug, so I'll fix it.
>
> > +}
> > +
> > +static inline int unpoison_taken_off_page(struct ratelimit_state *rs,
> > + struct page *p)
> > +{
> > + if (take_page_back_buddy(p) && !clear_page_hwpoison(rs, p))
>
> If clear_page_hwpoison() is void, it can be moved into take_page_back_buddy().
>
> > + return 0;
> > + else
> > + return -EBUSY;
> > +}
> > +
> > /**
> > * unpoison_memory - Unpoison a previously poisoned page
> > * @pfn: Page number of the to be unpoisoned page
> > @@ -1958,9 +2023,7 @@ int unpoison_memory(unsigned long pfn)
> > {
> > struct page *page;
> > struct page *p;
> > - int freeit = 0;
> > - int ret = 0;
> > - unsigned long flags = 0;
> > + int ret = -EBUSY;
> > static DEFINE_RATELIMIT_STATE(unpoison_rs, DEFAULT_RATELIMIT_INTERVAL,
> > DEFAULT_RATELIMIT_BURST);
> >
> > @@ -1996,24 +2059,27 @@ int unpoison_memory(unsigned long pfn)
> > goto unlock_mutex;
> > }
> >
> > - if (!get_hwpoison_page(p, flags)) {
> > - if (TestClearPageHWPoison(p))
> > - num_poisoned_pages_dec();
> > - unpoison_pr_info("Unpoison: Software-unpoisoned free page %#lx\n",
> > - pfn, &unpoison_rs);
> > + if (PageSlab(page))
> > goto unlock_mutex;
> > - }
> >
> > - if (TestClearPageHWPoison(page)) {
> > - unpoison_pr_info("Unpoison: Software-unpoisoned page %#lx\n",
> > - pfn, &unpoison_rs);
> > - num_poisoned_pages_dec();
> > - freeit = 1;
> > - }
> > + ret = get_hwpoison_page(p, MF_UNPOISON);
> > + if (!ret) {
> > + ret = clear_page_hwpoison(&unpoison_rs, p);
> > + } else if (ret < 0) {
> > + if (ret == -EHWPOISON) {
> > + ret = unpoison_taken_off_page(&unpoison_rs, p);
> > + } else
> > + unpoison_pr_info("Unpoison: failed to grab page %#lx\n",
> > + pfn, &unpoison_rs);
> > + } else {
> > + int freeit = clear_page_hwpoison(&unpoison_rs, p);
> >
> > - put_page(page);
> > - if (freeit && !(pfn == my_zero_pfn(0) && page_count(p) == 1))
> > put_page(page);
> > + if (freeit && !(pfn == my_zero_pfn(0) && page_count(p) == 1)) {
> > + put_page(page);
> > + ret = 0;
> > + }
> > + }
> >
> > unlock_mutex:
> > mutex_unlock(&mf_mutex);
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 4ea590646f89..b6e4cbb44c54 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -9466,6 +9466,7 @@ bool take_page_off_buddy(struct page *page)
> > del_page_from_free_list(page_head, zone, page_order);
> > break_down_buddy_pages(zone, page_head, page, 0,
> > page_order, migratetype);
> > + SetPageHWPoisonTakenOff(page);
> > if (!is_migrate_isolate(migratetype))
> > __mod_zone_freepage_state(zone, -1, migratetype);
> > ret = true;
> > @@ -9477,4 +9478,26 @@ bool take_page_off_buddy(struct page *page)
> > spin_unlock_irqrestore(&zone->lock, flags);
> > return ret;
> > }
> > +
> > +/*
> > + * Cancel takeoff done by take_page_off_buddy().
> > + */
> > +bool take_page_back_buddy(struct page *page)
>
> put_page_back_buddy() sounds more natural?
Thanks for the suggestion, sounds nice, so I'll rename it.
Thanks,
Naoya Horiguchi