On Mon, Feb 27, 2017 at 09:50:24AM +0100, Michal Hocko wrote:
On Fri 24-02-17 11:51:05, Johannes Weiner wrote:
[...]
>From 29fefdca148e28830e0934d4e6cceb95ed2ee36e Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@xxxxxxxxxxx>
Date: Fri, 24 Feb 2017 10:56:32 -0500
Subject: [PATCH] mm: vmscan: disable kswapd on unreclaimable nodes
Jia He reports a problem with kswapd spinning at 100% CPU when
requesting more hugepages than memory available in the system:
$ echo 4000 >/proc/sys/vm/nr_hugepages
top - 13:42:59 up 3:37, 1 user, load average: 1.09, 1.03, 1.01
Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 12.5 sy, 0.0 ni, 85.5 id, 2.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 31371520 total, 30915136 used, 456384 free, 320 buffers
KiB Swap: 6284224 total, 115712 used, 6168512 free. 48192 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
76 root 20 0 0 0 0 R 100.0 0.000 217:17.29 kswapd3
At that time, there are no reclaimable pages left in the node, but as
kswapd fails to restore the high watermarks it refuses to go to sleep.
Kswapd needs to back away from nodes that fail to balance. Up until
1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
kswapd had such a mechanism. It considered zones whose theoretically
reclaimable pages it had reclaimed six times over as unreclaimable and
backed away from them. This guard was erroneously removed as the patch
changed the definition of a balanced node.
However, simply restoring this code wouldn't help in the case reported
here: there *are* no reclaimable pages that could be scanned until the
threshold is met. Kswapd would stay awake anyway.
Introduce a new and much simpler way of backing off. If kswapd runs
through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
page, make it back off from the node. This is the same number of shots
direct reclaim takes before declaring OOM. Kswapd will go to sleep on
that node until a direct reclaimer manages to reclaim some pages, thus
proving the node reclaimable again.
Yes this looks, nice&simple. I would just be worried about [1] a bit.
Maybe that is worth a separate patch though.
[1] http://lkml.kernel.org/r/20170223111609.hlncnvokhq3quxwz@xxxxxxxxxxxxxx
I think I'd prefer the simplicity of keeping this contained inside
vmscan.c, as an interaction between direct reclaimers and kswapd, as
well as leaving the wakeup tied to actually seeing reclaimable pages
rather than merely producing free pages (e.g. should we also add a
kick to a large munmap() for example?).
OOM kills come with such high latencies that I cannot imagine a
slightly quicker kswapd restart would matter in practice.
Reported-by: Jia He <hejianet@xxxxxxxxx>
Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
Acked-by: Michal Hocko <mhocko@xxxxxxxx>
Thanks!
I would have just one more suggestion. Please move MAX_RECLAIM_RETRIES
to mm/internal.h. This is MM internal thing and there is no need to make
it visible.
Good point, I'll move it.