[RFC PATCH 2/2] mm, page_alloc: loose the node binding check to avoid helpless oom killing

From: Feng Tang
Date: Wed Nov 04 2020 - 01:19:54 EST


With the incoming of memory hotplug feature and persitent memory, in
some platform there are memory nodes which only have movable zone.

Users may bind some of their workload(like docker/container) to
these nodes, and there are many reports of OOM and page allocation
failures, one callstack is:

[ 1387.877565] runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0
[ 1387.877568] CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G W I E 5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased)
[ 1387.877569] Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020
[ 1387.877570] Call Trace:
[ 1387.877579] dump_stack+0x6b/0x88
[ 1387.877584] dump_header+0x4a/0x1e2
[ 1387.877586] oom_kill_process.cold+0xb/0x10
[ 1387.877588] out_of_memory.part.0+0xaf/0x230
[ 1387.877591] out_of_memory+0x3d/0x80
[ 1387.877595] __alloc_pages_slowpath.constprop.0+0x954/0xa20
[ 1387.877599] __alloc_pages_nodemask+0x2d3/0x300
[ 1387.877602] pipe_write+0x322/0x590
[ 1387.877607] new_sync_write+0x196/0x1b0
[ 1387.877609] vfs_write+0x1c3/0x1f0
[ 1387.877611] ksys_write+0xa7/0xe0
[ 1387.877617] do_syscall_64+0x52/0xd0
[ 1387.877621] entry_SYSCALL_64_after_hwframe+0x44/0xa9

In a full container run, like installing and running the stress tool
"stress-ng", there are many different kinds of page requests (gfp_masks),
many of which only allow non-movable zones. Some of them can fall back
to other nodes with NORMAL/DMA32/DMA zones, but others are blocked by
the __GFP_HARDWALL or ALLOC_CPUSET check, and cause OOM killing. But
OOM killing won't do any help here, as this is not an issue of lack of
free memory, but simply blocked by the node binding policy check.

So loose the policy check for this case.

Signed-off-by: Feng Tang <feng.tang@xxxxxxxxx>
---
mm/page_alloc.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d772206..efd49a9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4669,6 +4669,28 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
if (!ac->preferred_zoneref->zone)
goto nopage;

+ /*
+ * If the task's target memory nodes only has movable zones, while the
+ * gfp_mask allowed zone is lower than ZONE_MOVABLE, loose the check
+ * for __GFP_HARDWALL and ALLOC_CPUSET, otherwise it could trigger OOM
+ * killing, which still can not solve this policy check.
+ */
+ if (ac->highest_zoneidx <= ZONE_NORMAL) {
+ int nid;
+ unsigned long unmovable = 0;
+
+ /* FIXME: this could be a separate function */
+ for_each_node_mask(nid, cpuset_current_mems_allowed) {
+ unmovable += NODE_DATA(nid)->node_present_pages -
+ NODE_DATA(nid)->node_zones[ZONE_MOVABLE].present_pages;
+ }
+
+ if (!unmovable) {
+ gfp_mask &= ~(__GFP_HARDWALL);
+ alloc_flags &= ~ALLOC_CPUSET;
+ }
+ }
+
if (alloc_flags & ALLOC_KSWAPD)
wake_all_kswapds(order, gfp_mask, ac);

--
2.7.4