Re: [PATCH v2 1/2] mm: fix race on soft-offlining free huge pages
From: Naoya Horiguchi
Date: Tue Jul 17 2018 - 21:29:03 EST
On Tue, Jul 17, 2018 at 01:10:39PM -0700, Mike Kravetz wrote:
> On 07/17/2018 07:27 AM, Michal Hocko wrote:
> > On Tue 17-07-18 14:32:31, Naoya Horiguchi wrote:
> >> There's a race condition between soft offline and hugetlb_fault which
> >> causes unexpected process killing and/or hugetlb allocation failure.
> >>
> >> The process killing is caused by the following flow:
> >>
> >> CPU 0 CPU 1 CPU 2
> >>
> >> soft offline
> >> get_any_page
> >> // find the hugetlb is free
> >> mmap a hugetlb file
> >> page fault
> >> ...
> >> hugetlb_fault
> >> hugetlb_no_page
> >> alloc_huge_page
> >> // succeed
> >> soft_offline_free_page
> >> // set hwpoison flag
> >> mmap the hugetlb file
> >> page fault
> >> ...
> >> hugetlb_fault
> >> hugetlb_no_page
> >> find_lock_page
> >> return VM_FAULT_HWPOISON
> >> mm_fault_error
> >> do_sigbus
> >> // kill the process
> >>
> >>
> >> The hugetlb allocation failure comes from the following flow:
> >>
> >> CPU 0 CPU 1
> >>
> >> mmap a hugetlb file
> >> // reserve all free page but don't fault-in
> >> soft offline
> >> get_any_page
> >> // find the hugetlb is free
> >> soft_offline_free_page
> >> // set hwpoison flag
> >> dissolve_free_huge_page
> >> // fail because all free hugepages are reserved
> >> page fault
> >> ...
> >> hugetlb_fault
> >> hugetlb_no_page
> >> alloc_huge_page
> >> ...
> >> dequeue_huge_page_node_exact
> >> // ignore hwpoisoned hugepage
> >> // and finally fail due to no-mem
> >>
> >> The root cause of this is that current soft-offline code is written
> >> based on an assumption that PageHWPoison flag should beset at first to
> >> avoid accessing the corrupted data. This makes sense for memory_failure()
> >> or hard offline, but does not for soft offline because soft offline is
> >> about corrected (not uncorrected) error and is safe from data lost.
> >> This patch changes soft offline semantics where it sets PageHWPoison flag
> >> only after containment of the error page completes successfully.
> >
> > Could you please expand on the worklow here please? The code is really
> > hard to grasp. I must be missing something because the thing shouldn't
> > be really complicated. Either the page is in the free pool and you just
> > remove it from the allocator (with hugetlb asking for a new hugeltb page
> > to guaratee reserves) or it is used and you just migrate the content to
> > a new page (again with the hugetlb reserves consideration). Why should
> > PageHWPoison flag ordering make any relevance?
>
> My understanding may not be corect, but just looking at the current code
> for soft_offline_free_page helps me understand:
>
> static void soft_offline_free_page(struct page *page)
> {
> struct page *head = compound_head(page);
>
> if (!TestSetPageHWPoison(head)) {
> num_poisoned_pages_inc();
> if (PageHuge(head))
> dissolve_free_huge_page(page);
> }
> }
>
> The HWPoison flag is set before even checking to determine if the huge
> page can be dissolved. So, someone could could attempt to pull the page
> off the free list (if free) or fault/map it (if already associated with
> a file) which leads to the failures described above. The patches ensure
> that we only set HWPoison after successfully dissolving the page. At least
> that is how I understand it.
Thanks for elaborating, this is correct.
>
> It seems that soft_offline_free_page can be called for in use pages.
> Certainly, that is the case in the first workflow above. With the
> suggested changes, I think this is OK for huge pages. However, it seems
> that setting HWPoison on a in use non-huge page could cause issues?
Just after dissolve_free_huge_page() returns, the target page is just a
free buddy page without PageHWPoison set. If this page is allocated
immediately, that's "migration succeeded, but soft offline failed" case,
so no problem.
Certainly, there also is a race between cheking TestSetPageHWPoison and
page allocation, so this issue is handled in patch 2/2.
> While looking at the code, I noticed this comment in __get_any_page()
> /*
> * When the target page is a free hugepage, just remove it
> * from free hugepage list.
> */
> Did that apply to some code that was removed? It does not seem to make
> any sense in that routine.
This comment is completely obsolete, I'll remove this one.
Thanks,
Naoya Horiguchi