Re: [PATCH v2] mm, vmscan: prevent useless kswapd loops

From: Mel Gorman
Date: Wed Jul 03 2019 - 04:38:13 EST

On Mon, Jul 01, 2019 at 01:18:47PM -0700, Shakeel Butt wrote:
> On production we have noticed hard lockups on large machines running
> large jobs due to kswaps hoarding lru lock within isolate_lru_pages when
> sc->reclaim_idx is 0 which is a small zone. The lru was couple hundred
> GiBs and the condition (page_zonenum(page) > sc->reclaim_idx) in
> isolate_lru_pages was basically skipping GiBs of pages while holding the
> LRU spinlock with interrupt disabled.
> On further inspection, it seems like there are two issues:
> 1) If the kswapd on the return from balance_pgdat() could not sleep
> (i.e. node is still unbalanced), the classzone_idx is unintentionally
> set to 0 and the whole reclaim cycle of kswapd will try to reclaim
> only the lowest and smallest zone while traversing the whole memory.
> 2) Fundamentally isolate_lru_pages() is really bad when the allocation
> has woken kswapd for a smaller zone on a very large machine running very
> large jobs. It can hoard the LRU spinlock while skipping over 100s of
> GiBs of pages.
> This patch only fixes the (1). The (2) needs a more fundamental solution.
> To fix (1), in the kswapd context, if pgdat->kswapd_classzone_idx is
> invalid use the classzone_idx of the previous kswapd loop otherwise use
> the one the waker has requested.
> Fixes: e716f2eb24de ("mm, vmscan: prevent kswapd sleeping prematurely
> due to mismatched classzone_idx")
> Signed-off-by: Shakeel Butt <shakeelb@xxxxxxxxxx>

Acked-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>

Mel Gorman