Re: Infinite looping observed in __offline_pages

From: John Allen
Date: Fri Jul 27 2018 - 13:33:09 EST


On Wed, Jul 25, 2018 at 10:03:36PM +0200, Michal Hocko wrote:
On Wed 25-07-18 13:11:15, John Allen wrote:
[...]
Does a failure in do_migrate_range indicate that the range is unmigratable
and the loop in __offline_pages should terminate and goto failed_removal? Or
should we allow a certain number of retrys before we
give up on migrating the range?

Unfortunatelly not. Migration code doesn't tell a difference between
ephemeral and permanent failures. We are relying on
start_isolate_page_range to tell us this. So the question is, what kind
of page is not migratable and for what reason.

Are you able to add some debugging to give us more information. The
current debugging code in the hotplug/migration sucks...

After reproducing the problem a couple times, it seems that it can occur for different types of pages. Running page-types on the offending page over two separate instances produced the following:

# tools/vm/page-types -a 307968-308224
flags page-count MB symbolic-flags long-symbolic-flags
0x0000000000000400 1 0 __________B________________________________ buddy
total 1 0

And the following on a separate run:

# tools/vm/page-types -a 313088-313344
flags page-count MB symbolic-flags long-symbolic-flags
0x000000000000006c 1 0 __RU_lA____________________________________ referenced,uptodate,lru,active
total 1 0

The source of the failure in migrate_pages actually doesn't seem to be that we're hitting the case of the permanent failure, but instead the -EAGAIN case. I traced the EAGAIN return back to migrate_page_move_mapping which I've seen return EAGAIN in two places:

mm/migrate.c:453
if (!mapping) {
/* Anonymous page without mapping */
if (page_count(page) != expected_count)
return -EAGAIN;

mm/migrate.c:476
if (page_count(page) != expected_count ||
radix_tree_deref_slot_protected(pslot,
&mapping->i_pages.xa_lock) != page) {
xa_unlock_irq(&mapping->i_pages);
return -EAGAIN;
}

So it seems in each case, the actual reference count for the page is not what it is expected to be.