[bug/regression] libhugetlbfs testsuite failures and OOMs eventually kill my system

From: Jan Stancek
Date: Thu Oct 13 2016 - 08:28:18 EST


Hi,

I'm running into ENOMEM failures with libhugetlbfs testsuite [1] on
a power8 lpar system running 4.8 or latest git [2]. Repeated runs of
this suite trigger multiple OOMs, that eventually kill entire system,
it usually takes 3-5 runs:

* Total System Memory......: 18024 MB
* Shared Mem Max Mapping...: 320 MB
* System Huge Page Size....: 16 MB
* Available Huge Pages.....: 20
* Total size of Huge Pages.: 320 MB
* Remaining System Memory..: 17704 MB
* Huge Page User Group.....: hugepages (1001)

I see this only on ppc (BE/LE), x86_64 seems unaffected and successfully
ran the tests for ~12 hours.

Bisect has identified following patch as culprit:
commit 67961f9db8c477026ea20ce05761bde6f8bf85b0
Author: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
Date: Wed Jun 8 15:33:42 2016 -0700
mm/hugetlb: fix huge page reserve accounting for private mappings


Following patch (made with my limited insight) applied to
latest git [2] fixes the problem for me:

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ec49d9e..7261583 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1876,7 +1876,7 @@ static long __vma_reservation_common(struct hstate *h,
* return value of this routine is the opposite of the
* value returned from reserve map manipulation routines above.
*/
- if (ret)
+ if (ret >= 0)
return 0;
else
return 1;

Regards,
Jan

[1] https://github.com/libhugetlbfs/libhugetlbfs
[2] v4.8-14230-gb67be92