Re: [PATCH v2 1/2] mm: fix race on soft-offlining free huge pages
From: Naoya Horiguchi
Date: Thu Jul 19 2018 - 04:10:17 EST
On Thu, Jul 19, 2018 at 09:15:16AM +0200, Michal Hocko wrote:
> On Thu 19-07-18 06:19:45, Naoya Horiguchi wrote:
> > On Wed, Jul 18, 2018 at 10:50:32AM +0200, Michal Hocko wrote:
> > > On Wed 18-07-18 00:55:29, Naoya Horiguchi wrote:
> > > > On Tue, Jul 17, 2018 at 04:27:43PM +0200, Michal Hocko wrote:
> > > > > On Tue 17-07-18 14:32:31, Naoya Horiguchi wrote:
> > > > > > There's a race condition between soft offline and hugetlb_fault which
> > > > > > causes unexpected process killing and/or hugetlb allocation failure.
> > > > > >
> > > > > > The process killing is caused by the following flow:
> > > > > >
> > > > > > CPU 0 CPU 1 CPU 2
> > > > > >
> > > > > > soft offline
> > > > > > get_any_page
> > > > > > // find the hugetlb is free
> > > > > > mmap a hugetlb file
> > > > > > page fault
> > > > > > ...
> > > > > > hugetlb_fault
> > > > > > hugetlb_no_page
> > > > > > alloc_huge_page
> > > > > > // succeed
> > > > > > soft_offline_free_page
> > > > > > // set hwpoison flag
> > > > > > mmap the hugetlb file
> > > > > > page fault
> > > > > > ...
> > > > > > hugetlb_fault
> > > > > > hugetlb_no_page
> > > > > > find_lock_page
> > > > > > return VM_FAULT_HWPOISON
> > > > > > mm_fault_error
> > > > > > do_sigbus
> > > > > > // kill the process
> > > > > >
> > > > > >
> > > > > > The hugetlb allocation failure comes from the following flow:
> > > > > >
> > > > > > CPU 0 CPU 1
> > > > > >
> > > > > > mmap a hugetlb file
> > > > > > // reserve all free page but don't fault-in
> > > > > > soft offline
> > > > > > get_any_page
> > > > > > // find the hugetlb is free
> > > > > > soft_offline_free_page
> > > > > > // set hwpoison flag
> > > > > > dissolve_free_huge_page
> > > > > > // fail because all free hugepages are reserved
> > > > > > page fault
> > > > > > ...
> > > > > > hugetlb_fault
> > > > > > hugetlb_no_page
> > > > > > alloc_huge_page
> > > > > > ...
> > > > > > dequeue_huge_page_node_exact
> > > > > > // ignore hwpoisoned hugepage
> > > > > > // and finally fail due to no-mem
> > > > > >
> > > > > > The root cause of this is that current soft-offline code is written
> > > > > > based on an assumption that PageHWPoison flag should beset at first to
> > > > > > avoid accessing the corrupted data. This makes sense for memory_failure()
> > > > > > or hard offline, but does not for soft offline because soft offline is
> > > > > > about corrected (not uncorrected) error and is safe from data lost.
> > > > > > This patch changes soft offline semantics where it sets PageHWPoison flag
> > > > > > only after containment of the error page completes successfully.
> > > > >
> > > > > Could you please expand on the worklow here please? The code is really
> > > > > hard to grasp. I must be missing something because the thing shouldn't
> > > > > be really complicated. Either the page is in the free pool and you just
> > > > > remove it from the allocator (with hugetlb asking for a new hugeltb page
> > > > > to guaratee reserves) or it is used and you just migrate the content to
> > > > > a new page (again with the hugetlb reserves consideration). Why should
> > > > > PageHWPoison flag ordering make any relevance?
> > > >
> > > > (Considering soft offlining free hugepage,)
> > > > PageHWPoison is set at first before this patch, which is racy with
> > > > hugetlb fault code because it's not protected by hugetlb_lock.
> > > >
> > > > Originally this was written in the similar manner as hard-offline, where
> > > > the race is accepted and a PageHWPoison flag is set as soon as possible.
> > > > But actually that's found not necessary/correct because soft offline is
> > > > supposed to be less aggressive and failure is OK.
> > >
> > > OK
> > >
> > > > So this patch is suggesting to make soft-offline less aggressive by
> > > > moving SetPageHWPoison into the lock.
> > >
> > > I guess I still do not understand why we should even care about the
> > > ordering of the HWPoison flag setting. Why cannot we simply have the
> > > following code flow? Or maybe we are doing that already I just do not
> > > follow the code
> > >
> > > soft_offline
> > > check page_count
> > > - free - normal page - remove from the allocator
> > > - hugetlb - allocate a new hugetlb page && remove from the pool
> > > - used - migrate to a new page && never release the old one
> > >
> > > Why do we even need HWPoison flag here? Everything can be completely
> > > transparent to the application. It shouldn't fail from what I
> > > understood.
> >
> > PageHWPoison flag is used to the 'remove from the allocator' part
> > which is like below:
> >
> > static inline
> > struct page *rmqueue(
> > ...
> > do {
> > page = NULL;
> > if (alloc_flags & ALLOC_HARDER) {
> > page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
> > if (page)
> > trace_mm_page_alloc_zone_locked(page, order, migratetype);
> > }
> > if (!page)
> > page = __rmqueue(zone, order, migratetype);
> > } while (page && check_new_pages(page, order));
> >
> > check_new_pages() returns true if the page taken from free list has
> > a hwpoison page so that the allocator iterates another round to get
> > another page.
> >
> > There's no function that can be called from outside allocator to remove
> > a page in allocator. So actual page removal is done at allocation time,
> > not at error handling time. That's the reason why we need PageHWPoison.
>
> hwpoison is an internal mm functionality so why cannot we simply add a
> function that would do that?
That's one possible solution.
I know about another downside in current implementation.
If a hwpoison page is found during high order page allocation,
all 2^order pages (not only hwpoison page) are removed from
buddy because of the above quoted code. And these leaked pages
are never returned to freelist even with unpoison_memory().
If we have a page removal function which properly splits high order
free pages into lower order pages, this problem is avoided.
OTOH PageHWPoison still has a role to report error to userspace.
Without it unpoison_memory() doesn't work.
Thanks,
Naoya Horiguchi
> I find the PageHWPoison usage here doing
> more complications than real good. Or am I missing something?