On Wed, 2 Jun 2021, Yu Xu wrote:
On 6/2/21 12:55 AM, Hugh Dickins wrote:
On Wed, 2 Jun 2021, Xu Yu wrote:
We notice that hung task happens in a conner but practical scenario when
CONFIG_PREEMPT_NONE is enabled, as follows.
Process 0 Process 1 Process
2..Inf
split_huge_page_to_list
unmap_page
split_huge_pmd_address
__migration_entry_wait(head)
__migration_entry_wait(tail)
remap_page (roll back)
remove_migration_ptes
rmap_walk_anon
cond_resched
Where __migration_entry_wait(tail) is occurred in kernel space, e.g.,
copy_to_user, which will immediately fault again without rescheduling,
and thus occupy the cpu fully.
When there are too many processes performing __migration_entry_wait on
tail page, remap_page will never be done after cond_resched.
This relaxes __migration_entry_wait on tail page, thus gives remap_page
a chance to complete.
Signed-off-by: Gang Deng <gavin.dg@xxxxxxxxxxxxxxxxx>
Signed-off-by: Xu Yu <xuyu@xxxxxxxxxxxxxxxxx>
Well caught: you're absolutely right that there's a bug there.
But isn't cond_resched() just papering over the real bug, and
what it should do is a "page = compound_head(page);" before the
get_page_unless_zero()? How does that work out in your testing?
compound_head works. The patched kernel is alive for hours under
our reproducer, which usually makes the vanilla kernel hung after
tens of minutes at most.
Oh, that's good news, thanks.
(It's still likely that a well-placed cond_resched() somewhere in
mm/gup.c would also be a good idea, but none of us have yet got
around to identifying where.)
If we use compound_head, the behavior of __migration_entry_wait(tail)
changes from "retry fault" to "prevent THP from being split". Is that
right? Then which is preferred? If it were me, I would prefer "retry
fault".
As Matthew remarked, you are asking very good questions, and split
migration entries are difficult to think about. But I believe you'll
find it works out okay.
The point of *put_and_* wait_on_page_locked() is that it does drop
the page reference you acquired with get_page_unless_zero, as soon
as the page is on the wait queue, before actually waiting.
So splitting the THP is only prevented for a brief interval. Now,
it's true that if there are very many tasks faulting on portions
of the huge page, in that interval between inserting the migration
entries and freezing the huge page's refcount to 0, they can reduce
the chance of splitting considerably. But that's not an excuse for
for doing get_page_unless_zero() on the wrong thing, as it was doing.
Hugh