Re: [PATCH] mm, thp: relax migration wait when failed to get tail page

From: Hugh Dickins
Date: Wed Jun 02 2021 - 11:57:46 EST


On Wed, 2 Jun 2021, Yu Xu wrote:
> On 6/2/21 12:55 AM, Hugh Dickins wrote:
> > On Wed, 2 Jun 2021, Xu Yu wrote:
> >
> > > We notice that hung task happens in a conner but practical scenario when
> > > CONFIG_PREEMPT_NONE is enabled, as follows.
> > >
> > > Process 0 Process 1 Process
> > > 2..Inf
> > > split_huge_page_to_list
> > > unmap_page
> > > split_huge_pmd_address
> > > __migration_entry_wait(head)
> > > __migration_entry_wait(tail)
> > > remap_page (roll back)
> > > remove_migration_ptes
> > > rmap_walk_anon
> > > cond_resched
> > >
> > > Where __migration_entry_wait(tail) is occurred in kernel space, e.g.,
> > > copy_to_user, which will immediately fault again without rescheduling,
> > > and thus occupy the cpu fully.
> > >
> > > When there are too many processes performing __migration_entry_wait on
> > > tail page, remap_page will never be done after cond_resched.
> > >
> > > This relaxes __migration_entry_wait on tail page, thus gives remap_page
> > > a chance to complete.
> > >
> > > Signed-off-by: Gang Deng <gavin.dg@xxxxxxxxxxxxxxxxx>
> > > Signed-off-by: Xu Yu <xuyu@xxxxxxxxxxxxxxxxx>
> >
> > Well caught: you're absolutely right that there's a bug there.
> > But isn't cond_resched() just papering over the real bug, and
> > what it should do is a "page = compound_head(page);" before the
> > get_page_unless_zero()? How does that work out in your testing?
>
> compound_head works. The patched kernel is alive for hours under
> our reproducer, which usually makes the vanilla kernel hung after
> tens of minutes at most.

Oh, that's good news, thanks.

(It's still likely that a well-placed cond_resched() somewhere in
mm/gup.c would also be a good idea, but none of us have yet got
around to identifying where.)

>
> If we use compound_head, the behavior of __migration_entry_wait(tail)
> changes from "retry fault" to "prevent THP from being split". Is that
> right? Then which is preferred? If it were me, I would prefer "retry
> fault".

As Matthew remarked, you are asking very good questions, and split
migration entries are difficult to think about. But I believe you'll
find it works out okay.

The point of *put_and_* wait_on_page_locked() is that it does drop
the page reference you acquired with get_page_unless_zero, as soon
as the page is on the wait queue, before actually waiting.

So splitting the THP is only prevented for a brief interval. Now,
it's true that if there are very many tasks faulting on portions
of the huge page, in that interval between inserting the migration
entries and freezing the huge page's refcount to 0, they can reduce
the chance of splitting considerably. But that's not an excuse for
for doing get_page_unless_zero() on the wrong thing, as it was doing.

Hugh