Re: [PATCH 0/3] OOM detection rework v4

From: Michal Hocko
Date: Tue Mar 01 2016 - 08:38:55 EST


[Adding Vlastimil and Joonsoo for compaction related things - this was a
large thread but the more interesting part starts with
http://lkml.kernel.org/r/alpine.LSU.2.11.1602241832160.15564@xxxxxxxxxxxx]

On Mon 29-02-16 23:29:06, Hugh Dickins wrote:
> On Mon, 29 Feb 2016, Michal Hocko wrote:
> > On Wed 24-02-16 19:47:06, Hugh Dickins wrote:
> > [...]
> > > Boot with mem=1G (or boot your usual way, and do something to occupy
> > > most of the memory: I think /proc/sys/vm/nr_hugepages provides a great
> > > way to gobble up most of the memory, though it's not how I've done it).
> > >
> > > Make sure you have swap: 2G is more than enough. Copy the v4.5-rc5
> > > kernel source tree into a tmpfs: size=2G is more than enough.
> > > make defconfig there, then make -j20.
> > >
> > > On a v4.5-rc5 kernel that builds fine, on mmotm it is soon OOM-killed.
> > >
> > > Except that you'll probably need to fiddle around with that j20,
> > > it's true for my laptop but not for my workstation. j20 just happens
> > > to be what I've had there for years, that I now see breaking down
> > > (I can lower to j6 to proceed, perhaps could go a bit higher,
> > > but it still doesn't exercise swap very much).
> >
> > I have tried to reproduce and failed in a virtual on my laptop. I
> > will try with another host with more CPUs (because my laptop has only
> > two). Just for the record I did: boot 1G machine in kvm, I have 2G swap
> > and reserve 800M for hugetlb pages (I got 445 of them). Then I extract
> > the kernel source to tmpfs (-o size=2G), make defconfig and make -j20
> > (16, 10 no difference really). I was also collecting vmstat in the
> > background. The compilation takes ages but the behavior seems consistent
> > and stable.
>
> Thanks a lot for giving it a go.
>
> I'm puzzled. 445 hugetlb pages in 800M surprises me: some of them
> are less than 2M big?? But probably that's just a misunderstanding
> or typo somewhere.

A typo. 445 was from 900M test which I was doing while writing the
email. Sorry about the confusion.

> Ignoring that, you're successfully doing a make -20 defconfig build
> in tmpfs, with only 224M of RAM available, plus 2G of swap? I'm not
> at all surprised that it takes ages, but I am very surprised that it
> does not OOM. I suppose by rights it ought not to OOM, the built
> tree occupies only a little more than 1G, so you do have enough swap;
> but I wouldn't get anywhere near that myself without OOMing - I give
> myself 1G of RAM (well, minus whatever the booted system takes up)
> to do that build in, four times your RAM, yet in my case it OOMs.
>
> That source tree alone occupies more than 700M, so just copying it
> into your tmpfs would take a long time.

OK, I just found out that I was cheating a bit. I was building
linux-3.7-rc5.tar.bz2 which is smaller:
$ du -sh /mnt/tmpfs/linux-3.7-rc5/
537M /mnt/tmpfs/linux-3.7-rc5/

and after the defconfig build:
$ free
total used free shared buffers cached
Mem: 1008460 941904 66556 0 5092 806760
-/+ buffers/cache: 130052 878408
Swap: 2097148 42648 2054500
$ du -sh linux-3.7-rc5/
799M linux-3.7-rc5/

Sorry about that but this is what my other tests were using and I forgot
to check. Now let's try the same with the current linus tree:
host $ git archive v4.5-rc6 --prefix=linux-4.5-rc6/ | bzip2 > linux-4.5-rc6.tar.bz2
$ du -sh /mnt/tmpfs/linux-4.5-rc6/
707M /mnt/tmpfs/linux-4.5-rc6/
$ free
total used free shared buffers cached
Mem: 1008460 962976 45484 0 7236 820064
-/+ buffers/cache: 135676 872784
Swap: 2097148 16 2097132
$ time make -j20 > /dev/null
drivers/acpi/property.c: In function âacpi_data_prop_readâ:
drivers/acpi/property.c:745:8: warning: âobjâ may be used uninitialized in this function [-Wmaybe-uninitialized]

real 8m36.621s
user 14m1.642s
sys 2m45.238s

so I wasn't cheating all that much...

> I'd expect a build in 224M
> RAM plus 2G of swap to take so long, that I'd be very grateful to be
> OOM killed, even if there is technically enough space. Unless
> perhaps it's some superfast swap that you have?

the swap partition is a standard qcow image stored on my SSD disk. So
I guess the IO should be quite fast. This smells like a potential
contributor because my reclaim seems to be much faster and that should
lead to a more efficient reclaim (in the scanned/reclaimed sense).
I realize I might be boring already when blaming compaction but let me
try again ;)
$ grep compact /proc/vmstat
compact_migrate_scanned 113983
compact_free_scanned 1433503
compact_isolated 134307
compact_stall 128
compact_fail 26
compact_success 102
compact_kcompatd_wake 0

So the whole load has done the direct compaction only 128 times during
that test. This doesn't sound much to me
$ grep allocstall /proc/vmstat
allocstall 1061

we entered the direct reclaim much more but most of the load will be
order-0 so this might be still ok. So I've tried the following:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1993894b4219..107d444afdb1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2910,6 +2910,9 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
mode, contended_compaction);
current->flags &= ~PF_MEMALLOC;

+ if (order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER)
+ trace_printk("order:%d gfp_mask:%pGg compact_result:%lu\n", order, &gfp_mask, compact_result);
+
switch (compact_result) {
case COMPACT_DEFERRED:
*deferred_compaction = true;

And the result was:
$ cat /debug/tracing/trace_pipe | tee ~/trace.log
gcc-8707 [001] .... 137.946370: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
gcc-8726 [000] .... 138.528571: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1

this shows that order-2 memory pressure is not overly high in my
setup. Both attempts ended up COMPACT_SKIPPED which is interesting.

So I went back to 800M of hugetlb pages and tried again. It took ages
so I have interrupted that after one hour (there was still no OOM). The
trace log is quite interesting regardless:
$ wc -l ~/trace.log
371 /root/trace.log

$ grep compact_stall /proc/vmstat
compact_stall 190

so the compaction was still ignored more than actually invoked for
!costly allocations:
sed 's@.*order:\([[:digit:]]\).* compact_result:\([[:digit:]]\)@\1 \2@' ~/trace.log | sort | uniq -c
190 2 1
122 2 3
59 2 4

#define COMPACT_SKIPPED 1
#define COMPACT_PARTIAL 3
#define COMPACT_COMPLETE 4

that means that compaction is even not tried in half cases! This
doesn't sounds right to me, especially when we are talking about
<= PAGE_ALLOC_COSTLY_ORDER requests which are implicitly nofail, because
then we simply rely on the order-0 reclaim to automagically form higher
blocks. This might indeed work when we retry many times but I guess this
is not a good approach. It leads to a excessive reclaim and the stall
for allocation can be really large.

One of the suspicious places is __compaction_suitable which does order-0
watermark check (increased by 2<<order). I have put another trace_printk
there and it clearly pointed out this was the case.

So I have tried the following:
diff --git a/mm/compaction.c b/mm/compaction.c
index 4d99e1f5055c..7364e48cf69a 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1276,6 +1276,9 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
alloc_flags))
return COMPACT_PARTIAL;

+ if (order <= PAGE_ALLOC_COSTLY_ORDER)
+ return COMPACT_CONTINUE;
+
/*
* Watermarks for order-0 must be met for compaction. Note the 2UL.
* This is because during migration, copies of pages need to be

and retried the same test (without huge pages):
$ time make -j20 > /dev/null

real 8m46.626s
user 14m15.823s
sys 2m45.471s

the time increased but I haven't checked how stable the result is.

$ grep compact /proc/vmstat
compact_migrate_scanned 139822
compact_free_scanned 1661642
compact_isolated 139407
compact_stall 129
compact_fail 58
compact_success 71
compact_kcompatd_wake 1

$ grep allocstall /proc/vmstat
allocstall 1665

this is worse because we have scanned more pages for migration but the
overall success rate was much smaller and the direct reclaim was invoked
more. I do not have a good theory for that and will play with this some
more. Maybe other changes are needed deeper in the compaction code.

I will play with this some more but I would be really interested to hear
whether this helped Hugh with his setup. Vlastimi, Joonsoo does this
even make sense to you?

> I was only suggesting to allocate hugetlb pages, if you preferred
> not to reboot with artificially reduced RAM. Not an issue if you're
> booting VMs.

Ohh, I see.

--
Michal Hocko
SUSE Labs