Re: [PATCH v2] mm/vmscan: fix high cpu usage of kswapd if there are no reclaimable pages

From: Johannes Weiner
Date: Mon Feb 27 2017 - 12:56:53 EST


On Mon, Feb 27, 2017 at 09:50:24AM +0100, Michal Hocko wrote:
> On Fri 24-02-17 11:51:05, Johannes Weiner wrote:
> [...]
> > >From 29fefdca148e28830e0934d4e6cceb95ed2ee36e Mon Sep 17 00:00:00 2001
> > From: Johannes Weiner <hannes@xxxxxxxxxxx>
> > Date: Fri, 24 Feb 2017 10:56:32 -0500
> > Subject: [PATCH] mm: vmscan: disable kswapd on unreclaimable nodes
> >
> > Jia He reports a problem with kswapd spinning at 100% CPU when
> > requesting more hugepages than memory available in the system:
> >
> > $ echo 4000 >/proc/sys/vm/nr_hugepages
> >
> > top - 13:42:59 up 3:37, 1 user, load average: 1.09, 1.03, 1.01
> > Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie
> > %Cpu(s): 0.0 us, 12.5 sy, 0.0 ni, 85.5 id, 2.0 wa, 0.0 hi, 0.0 si, 0.0 st
> > KiB Mem: 31371520 total, 30915136 used, 456384 free, 320 buffers
> > KiB Swap: 6284224 total, 115712 used, 6168512 free. 48192 cached Mem
> >
> > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> > 76 root 20 0 0 0 0 R 100.0 0.000 217:17.29 kswapd3
> >
> > At that time, there are no reclaimable pages left in the node, but as
> > kswapd fails to restore the high watermarks it refuses to go to sleep.
> >
> > Kswapd needs to back away from nodes that fail to balance. Up until
> > 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> > kswapd had such a mechanism. It considered zones whose theoretically
> > reclaimable pages it had reclaimed six times over as unreclaimable and
> > backed away from them. This guard was erroneously removed as the patch
> > changed the definition of a balanced node.
> >
> > However, simply restoring this code wouldn't help in the case reported
> > here: there *are* no reclaimable pages that could be scanned until the
> > threshold is met. Kswapd would stay awake anyway.
> >
> > Introduce a new and much simpler way of backing off. If kswapd runs
> > through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
> > page, make it back off from the node. This is the same number of shots
> > direct reclaim takes before declaring OOM. Kswapd will go to sleep on
> > that node until a direct reclaimer manages to reclaim some pages, thus
> > proving the node reclaimable again.
>
> Yes this looks, nice&simple. I would just be worried about [1] a bit.
> Maybe that is worth a separate patch though.
>
> [1] http://lkml.kernel.org/r/20170223111609.hlncnvokhq3quxwz@xxxxxxxxxxxxxx

I think I'd prefer the simplicity of keeping this contained inside
vmscan.c, as an interaction between direct reclaimers and kswapd, as
well as leaving the wakeup tied to actually seeing reclaimable pages
rather than merely producing free pages (e.g. should we also add a
kick to a large munmap() for example?).

OOM kills come with such high latencies that I cannot imagine a
slightly quicker kswapd restart would matter in practice.

> > Reported-by: Jia He <hejianet@xxxxxxxxx>
> > Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
>
> Acked-by: Michal Hocko <mhocko@xxxxxxxx>

Thanks!

> I would have just one more suggestion. Please move MAX_RECLAIM_RETRIES
> to mm/internal.h. This is MM internal thing and there is no need to make
> it visible.

Good point, I'll move it.