Re: Please backport commit 3812c8c8f39 to stable

From: Michal Hocko
Date: Tue Oct 07 2014 - 08:33:39 EST


On Fri 03-10-14 11:16:31, Cong Wang wrote:
> On Fri, Oct 3, 2014 at 8:37 AM, Michal Hocko <mhocko@xxxxxxx> wrote:
> > On Thu 02-10-14 14:04:08, Cong Wang wrote:
> >> Hello again,
> >>
> >> I realized it is a series of patch actually:
> >>
> >> 3812c8c8f3953921ef18544110dafc3505c1ac62 mm: memcg: do not trap
> >> chargers with full callstack on OOM
> >> fb2a6fc56be66c169f8b80e07ed999ba453a2db2 mm: memcg: rework and
> >> document OOM waiting and wakeup
> >> 519e52473ebe9db5cdef44670d5a97f1fd53d721 mm: memcg: enable memcg OOM
> >> killer only for user faults
> >> 3a13c4d761b4b979ba8767f42345fed3274991b0 x86: finish user fault error
> >> path with fatal signal
> >> 759496ba6407c6994d6a5ce3a5e74937d7816208 arch: mm: pass userspace
> >> fault flag to generic fault handler
> >> 871341023c771ad233620b7a1fb3d9c7031c4e5c arch: mm: do not invoke OOM
> >> killer on kernel fault OOM
> >> 94bce453c78996cc4373d5da6cfabe07fcc6d9f9 arch: mm: remove obsolete
> >> init OOM protection
> >
> > Yes, that looks like the full series.
> >
> >> I am not sure if they have more dependencies.
> >>
> >> However, this bug is *fairly* easy to reproduce on 3.10, just using the
> >> following script:
> >>
> >> #!/bin/bash
> >>
> >> TEST_DIR=/tmp/cgroup_test
> >> [ -d $TEST_DIR ] || mkdir -p $TEST_DIR
> >> mount -t cgroup none $TEST_DIR -o memory
> >> mkdir $TEST_DIR/test
> >> echo 512k > $TEST_DIR/test/memory.limit_in_bytes
> >
> > This is just insane. You allow only 128 pages to be charged and the
> > reclaim will have to constantly wait for each page to finish the
> > writeback.
>
> This is a test case ONLY used to reproduce this bug, why it has to be
> sane? :)
>
> On the other hand, no matter how insane a test case is, as long as it
> triggers some hung tasks in kernel, it is a kernel bug needs to fix.

Well, my point was that an insane setting might produce a lot of
problems. And as said this problem has been inherent since the day 1.
So a real world example would be much more preferable. Especially when
we have this state for years and nobody triggered it.

[...]
> >> So please consider this seriously. :)
> >
> > The bug is there since the memory controller has been introduced. Yet we
> > only had a single report happening in the real life. So I do not think
> > this is that urgent. It was definitely not a good design decision that
> > OOM killer was handled on top of unknown locks which might prevent from
> > forward progress. No question about that. Do you see the problem in the
> > real life somewhere because to be honest the test case is pretty much
> > insane.
>
> I am sorry to confuse you that it is my the above test case which caused
> this bug. No, we saw this bug in *production* in our data center, it happened
> on 30+ machines!! :) The above insane test case is ONLY to draw your
> attention on how serious the bug is, nothing else.

Sure then the issue definitely needs to be fixed.

You have written in other email, that you have a backport. I will help
you with the review if you post it publicly.

--
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/