Re: Infinite looping observed in __offline_pages

From: Rashmica
Date: Tue Jul 31 2018 - 21:37:14 EST




On 26/07/18 04:11, John Allen wrote:
> Hi All,
>
> Under heavy stress and constant memory hot add/remove, I have observed
> the following loop to occasionally loop infinitely:
>
> mm/memory_hotplug.c:__offline_pages
>
> repeat:
> ÂÂÂÂÂÂ /* start memory hot removal */
> ÂÂÂÂÂÂ ret = -EINTR;
> ÂÂÂÂÂÂ if (signal_pending(current))
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ goto failed_removal;
>
> ÂÂÂÂÂÂ cond_resched();
> ÂÂÂÂÂÂ lru_add_drain_all();
> ÂÂÂÂÂÂ drain_all_pages(zone);
>
> ÂÂÂÂÂÂ pfn = scan_movable_pages(start_pfn, end_pfn);
> ÂÂÂÂÂÂ if (pfn) { /* We have movable pages */
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ret = do_migrate_range(pfn, end_pfn);
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ goto repeat;
> ÂÂÂÂÂÂ }
>

What is CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE set to for you?

I have also observed this when hot removing and adding memory. However I
only have only seen this when my kernel has
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=n (when it is set to online
automatically I do not have this issue) so I assumed that I wasn't
onlining the memory properly...

> What appears to be happening in this case is that do_migrate_range
> returns a failure code which is being ignored. The failure is stemming
> from migrate_pages returning "1" which I'm guessing is the result of
> us hitting the following case:
>
> mm/migrate.c: migrate_pages
>
> ÂÂÂÂdefault:
> ÂÂÂÂÂÂÂ /*
> ÂÂÂÂÂÂÂÂ * Permanent failure (-EBUSY, -ENOSYS, etc.):
> ÂÂÂÂÂÂÂÂ * unlike -EAGAIN case, the failed page is
> ÂÂÂÂÂÂÂÂ * removed from migration page list and not
> ÂÂÂÂÂÂÂÂ * retried in the next outer loop.
> ÂÂÂÂÂÂÂÂ */
> ÂÂÂÂÂÂÂ nr_failed++;
> ÂÂÂÂÂÂÂ break;
> ÂÂÂÂ}
>
> Does a failure in do_migrate_range indicate that the range is
> unmigratable and the loop in __offline_pages should terminate and goto
> failed_removal? Or should we allow a certain number of retrys before we
> give up on migrating the range?
>
> This issue was observed on a ppc64le lpar on a 4.18-rc6 kernel.
>
> -John
>