Re: [PATCH 0/3] OOM detection rework v4

From: Hugh Dickins
Date: Thu Mar 03 2016 - 04:55:33 EST


On Tue, 1 Mar 2016, Michal Hocko wrote:
> [Adding Vlastimil and Joonsoo for compaction related things - this was a
> large thread but the more interesting part starts with
> http://lkml.kernel.org/r/alpine.LSU.2.11.1602241832160.15564@xxxxxxxxxxxx]
>
> On Mon 29-02-16 23:29:06, Hugh Dickins wrote:
> > On Mon, 29 Feb 2016, Michal Hocko wrote:
> > > On Wed 24-02-16 19:47:06, Hugh Dickins wrote:
> > > [...]
> > > > Boot with mem=1G (or boot your usual way, and do something to occupy
> > > > most of the memory: I think /proc/sys/vm/nr_hugepages provides a great
> > > > way to gobble up most of the memory, though it's not how I've done it).
> > > >
> > > > Make sure you have swap: 2G is more than enough. Copy the v4.5-rc5
> > > > kernel source tree into a tmpfs: size=2G is more than enough.
> > > > make defconfig there, then make -j20.
> > > >
> > > > On a v4.5-rc5 kernel that builds fine, on mmotm it is soon OOM-killed.
> > > >
> > > > Except that you'll probably need to fiddle around with that j20,
> > > > it's true for my laptop but not for my workstation. j20 just happens
> > > > to be what I've had there for years, that I now see breaking down
> > > > (I can lower to j6 to proceed, perhaps could go a bit higher,
> > > > but it still doesn't exercise swap very much).
> > >
> > > I have tried to reproduce and failed in a virtual on my laptop. I
> > > will try with another host with more CPUs (because my laptop has only
> > > two). Just for the record I did: boot 1G machine in kvm, I have 2G swap

I've found that the number of CPUs makes quite a difference - I have 4.

And another difference between us may be in our configs: on this laptop
I had lots of debug options on (including DEBUG_VM, DEBUG_SPINLOCK and
PROVE_LOCKING, though not DEBUG_PAGEALLOC), which approximately doubles
the size of each shmem_inode (and those of course are not swappable).

I found that I could avoid the OOM if I ran the "make -j20" on a
kernel without all those debug options, and booted with nr_cpus=2.
And currently I'm booting the kernel with the debug options in,
but with nr_cpus=2, which does still OOM (whereas not if nr_cpus=1).

Maybe in the OOM rework, threads are cancelling each other's progress
more destructively, where before they co-operated to some extent?

(All that is on the laptop. The G5 is still busy full-time bisecting
a powerpc issue: I know it was OOMing with the rework, but I have not
verified the effect of nr_cpus on it. My x86 workstation has not been
OOMing with the rework - I think that means that I've not been exerting
as much memory pressure on it as I'd thought, that it copes with the load
better, and would only show the difference if I loaded it more heavily.)

> > > and reserve 800M for hugetlb pages (I got 445 of them). Then I extract
> > > the kernel source to tmpfs (-o size=2G), make defconfig and make -j20
> > > (16, 10 no difference really). I was also collecting vmstat in the
> > > background. The compilation takes ages but the behavior seems consistent
> > > and stable.
> >
> > Thanks a lot for giving it a go.
> >
> > I'm puzzled. 445 hugetlb pages in 800M surprises me: some of them
> > are less than 2M big?? But probably that's just a misunderstanding
> > or typo somewhere.
>
> A typo. 445 was from 900M test which I was doing while writing the
> email. Sorry about the confusion.

That makes more sense! Though I'm still amazed that you got anywhere,
taking so much of the usable memory out.

>
> > Ignoring that, you're successfully doing a make -20 defconfig build
> > in tmpfs, with only 224M of RAM available, plus 2G of swap? I'm not
> > at all surprised that it takes ages, but I am very surprised that it
> > does not OOM. I suppose by rights it ought not to OOM, the built
> > tree occupies only a little more than 1G, so you do have enough swap;
> > but I wouldn't get anywhere near that myself without OOMing - I give
> > myself 1G of RAM (well, minus whatever the booted system takes up)
> > to do that build in, four times your RAM, yet in my case it OOMs.
> >
> > That source tree alone occupies more than 700M, so just copying it
> > into your tmpfs would take a long time.
>
> OK, I just found out that I was cheating a bit. I was building
> linux-3.7-rc5.tar.bz2 which is smaller:
> $ du -sh /mnt/tmpfs/linux-3.7-rc5/
> 537M /mnt/tmpfs/linux-3.7-rc5/

Right, I have a habit like that too; but my habitual testing still
uses the 2.6.24 source tree, which is rather too old to ask others
to reproduce with - but we both find that the kernel source tree
keeps growing, and prefer to stick with something of a fixed size.

>
> and after the defconfig build:
> $ free
> total used free shared buffers cached
> Mem: 1008460 941904 66556 0 5092 806760
> -/+ buffers/cache: 130052 878408
> Swap: 2097148 42648 2054500
> $ du -sh linux-3.7-rc5/
> 799M linux-3.7-rc5/
>
> Sorry about that but this is what my other tests were using and I forgot
> to check. Now let's try the same with the current linus tree:
> host $ git archive v4.5-rc6 --prefix=linux-4.5-rc6/ | bzip2 > linux-4.5-rc6.tar.bz2
> $ du -sh /mnt/tmpfs/linux-4.5-rc6/
> 707M /mnt/tmpfs/linux-4.5-rc6/
> $ free
> total used free shared buffers cached
> Mem: 1008460 962976 45484 0 7236 820064

I guess we have different versions of "free": mine shows Shmem as shared,
but yours appears to be an older version, just showing 0.

> -/+ buffers/cache: 135676 872784
> Swap: 2097148 16 2097132
> $ time make -j20 > /dev/null
> drivers/acpi/property.c: In function âacpi_data_prop_readâ:
> drivers/acpi/property.c:745:8: warning: âobjâ may be used uninitialized in this function [-Wmaybe-uninitialized]
>
> real 8m36.621s
> user 14m1.642s
> sys 2m45.238s
>
> so I wasn't cheating all that much...
>
> > I'd expect a build in 224M
> > RAM plus 2G of swap to take so long, that I'd be very grateful to be
> > OOM killed, even if there is technically enough space. Unless
> > perhaps it's some superfast swap that you have?
>
> the swap partition is a standard qcow image stored on my SSD disk. So
> I guess the IO should be quite fast. This smells like a potential
> contributor because my reclaim seems to be much faster and that should
> lead to a more efficient reclaim (in the scanned/reclaimed sense).
> I realize I might be boring already when blaming compaction but let me
> try again ;)
> $ grep compact /proc/vmstat
> compact_migrate_scanned 113983
> compact_free_scanned 1433503
> compact_isolated 134307
> compact_stall 128
> compact_fail 26
> compact_success 102
> compact_kcompatd_wake 0
>
> So the whole load has done the direct compaction only 128 times during
> that test. This doesn't sound much to me
> $ grep allocstall /proc/vmstat
> allocstall 1061
>
> we entered the direct reclaim much more but most of the load will be
> order-0 so this might be still ok. So I've tried the following:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1993894b4219..107d444afdb1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2910,6 +2910,9 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> mode, contended_compaction);
> current->flags &= ~PF_MEMALLOC;
>
> + if (order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER)
> + trace_printk("order:%d gfp_mask:%pGg compact_result:%lu\n", order, &gfp_mask, compact_result);
> +
> switch (compact_result) {
> case COMPACT_DEFERRED:
> *deferred_compaction = true;
>
> And the result was:
> $ cat /debug/tracing/trace_pipe | tee ~/trace.log
> gcc-8707 [001] .... 137.946370: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
> gcc-8726 [000] .... 138.528571: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
>
> this shows that order-2 memory pressure is not overly high in my
> setup. Both attempts ended up COMPACT_SKIPPED which is interesting.
>
> So I went back to 800M of hugetlb pages and tried again. It took ages
> so I have interrupted that after one hour (there was still no OOM). The
> trace log is quite interesting regardless:
> $ wc -l ~/trace.log
> 371 /root/trace.log
>
> $ grep compact_stall /proc/vmstat
> compact_stall 190
>
> so the compaction was still ignored more than actually invoked for
> !costly allocations:
> sed 's@.*order:\([[:digit:]]\).* compact_result:\([[:digit:]]\)@\1 \2@' ~/trace.log | sort | uniq -c
> 190 2 1
> 122 2 3
> 59 2 4
>
> #define COMPACT_SKIPPED 1
> #define COMPACT_PARTIAL 3
> #define COMPACT_COMPLETE 4
>
> that means that compaction is even not tried in half cases! This
> doesn't sounds right to me, especially when we are talking about
> <= PAGE_ALLOC_COSTLY_ORDER requests which are implicitly nofail, because
> then we simply rely on the order-0 reclaim to automagically form higher
> blocks. This might indeed work when we retry many times but I guess this
> is not a good approach. It leads to a excessive reclaim and the stall
> for allocation can be really large.
>
> One of the suspicious places is __compaction_suitable which does order-0
> watermark check (increased by 2<<order). I have put another trace_printk
> there and it clearly pointed out this was the case.
>
> So I have tried the following:
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 4d99e1f5055c..7364e48cf69a 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1276,6 +1276,9 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
> alloc_flags))
> return COMPACT_PARTIAL;
>
> + if (order <= PAGE_ALLOC_COSTLY_ORDER)
> + return COMPACT_CONTINUE;
> +

I gave that a try just now, but it didn't help me: OOMed much sooner,
after doing half as much work. (FWIW, I have been including your other
patch, the "Andrew, could you queue this one as well, please" patch.)

I do agree that compaction appears to have closed down when we OOM:
taking that along with my nr_cpus remark (and the make -jNumber),
are parallel compactions interfering with each other destructively,
in a way that they did not before the rework?

> /*
> * Watermarks for order-0 must be met for compaction. Note the 2UL.
> * This is because during migration, copies of pages need to be
>
> and retried the same test (without huge pages):
> $ time make -j20 > /dev/null
>
> real 8m46.626s
> user 14m15.823s
> sys 2m45.471s
>
> the time increased but I haven't checked how stable the result is.

But I didn't investigate its stability either, may have judged against
it too soon.

>
> $ grep compact /proc/vmstat
> compact_migrate_scanned 139822
> compact_free_scanned 1661642
> compact_isolated 139407
> compact_stall 129
> compact_fail 58
> compact_success 71
> compact_kcompatd_wake 1

I have not seen any compact_kcompatd_wakes at all:
perhaps we're too busy compacting directly.

(Vlastimil, there's a "c" missing from that name, it should be
"compact_kcompactd_wake" - though "compact_daemon_wake" might be nicer.)

>
> $ grep allocstall /proc/vmstat
> allocstall 1665
>
> this is worse because we have scanned more pages for migration but the
> overall success rate was much smaller and the direct reclaim was invoked
> more. I do not have a good theory for that and will play with this some
> more. Maybe other changes are needed deeper in the compaction code.
>
> I will play with this some more but I would be really interested to hear
> whether this helped Hugh with his setup. Vlastimi, Joonsoo does this
> even make sense to you?

It didn't help me; but I do suspect you're right to be worrying about
the treatment of compaction of 0 < order <= PAGE_ALLOC_COSTLY_ORDER.

>
> > I was only suggesting to allocate hugetlb pages, if you preferred
> > not to reboot with artificially reduced RAM. Not an issue if you're
> > booting VMs.
>
> Ohh, I see.

I've attached vmstats.xz, output from your read_vmstat proggy;
together with oom.xz, the dmesg for the OOM in question.

I hacked out_of_memory() to count_vm_event(BALLOON_DEFLATE),
that being a count that's always 0 for me: so when you see
"balloon_deflate 1" towards the end, that's where the OOM
kill came in, and shortly after I Ctrl-C'ed.

I hope you can get more out of it than I have - thanks!

Hugh

Attachment: vmstats.xz
Description: application/xz

Attachment: oom.xz
Description: application/xz