Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

From: Trevor Cordes
Date: Sun Jan 29 2017 - 17:51:09 EST


On 2017-01-25 Michal Hocko wrote:
> On Wed 25-01-17 04:02:46, Trevor Cordes wrote:
> > OK, I patched & compiled mhocko's git tree from the other day
> > 4.9.0+. (To confirm, weird, but mhocko's git tree I'm using from a
> > couple of weeks ago shows the newest commit (git log) is
> > 69973b830859bc6529a7a0468ba0d80ee5117826 "Linux 4.9"? Let me know
> > if I'm doing something wrong, see below.)
>
> My fault. I should have noted that you should use since-4.9 branch.

OK, I have good news. I compiled your mhocko git tree (properly this
tim!) using since-4.9 branch (last commit
ca63ff9b11f958efafd8c8fa60fda14baec6149c Jan 25) and the box survived 3
3am's, over 60 hours, and I made sure all the usual oom culprits ran,
and I ran extras (finds on the whole tree, extra rdiff-backups) to try
to tax it. Based on my previous criteria I would say your since-4.9 as
of the above commit solves my bug, at least over a 3 day test span
(which it never survives when the bug is present)!

I tested WITHOUT any cgroup/mem boot options. I do still have my
mem=6G limiter on, though (I've never tested with it off, until I solve
the bug with it on, since I've had it on for many months for other
reasons).

On 2017-01-27 Michal Hocko wrote:
> OK, that matches the theory that these OOMs are caused by the
> incorrect active list aging fixed by b4536f0c829c ("mm, memcg: fix
> the active list aging for lowmem requests when memcg is enabled")

b4536f0c829c isn't in the since-4.9 I tested above though? So
something else you did must have fixed it (also)? I don't think I've
run any tests yet with b4536f0c829c in them? I think the vanillas I
was doing a couple of weeks ago were before b4536f0c829c, but I can't
be sure.

What do I test next? Does the since-4.9 stuff get pushed into vanilla
(4.9 hopefully?) so it can find its way into Fedora's stuck F24
kernel?

I want to also note that the RHBZ
https://bugzilla.redhat.com/show_bug.cgi?id=1401012 is garnering more
interest as more people start me-too'ing. The situation is almost
always the same: large rsync's or similar tree-scan accesses cause oom
on PAE boxes. However, I wanted to note that many people there reported
that cgroup_disable=memory doesn't fix anything for them, whereas that
always makes the problem go away on my boxes. Strange.

Thanks Michal and Mel, I really appreciate it!