Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

From: Michal Hocko
Date: Mon Jan 30 2017 - 02:52:51 EST


On Sun 29-01-17 16:50:03, Trevor Cordes wrote:
> On 2017-01-25 Michal Hocko wrote:
> > On Wed 25-01-17 04:02:46, Trevor Cordes wrote:
> > > OK, I patched & compiled mhocko's git tree from the other day
> > > 4.9.0+. (To confirm, weird, but mhocko's git tree I'm using from a
> > > couple of weeks ago shows the newest commit (git log) is
> > > 69973b830859bc6529a7a0468ba0d80ee5117826 "Linux 4.9"? Let me know
> > > if I'm doing something wrong, see below.)
> >
> > My fault. I should have noted that you should use since-4.9 branch.
>
> OK, I have good news. I compiled your mhocko git tree (properly this
> tim!) using since-4.9 branch (last commit
> ca63ff9b11f958efafd8c8fa60fda14baec6149c Jan 25) and the box survived 3
> 3am's, over 60 hours, and I made sure all the usual oom culprits ran,
> and I ran extras (finds on the whole tree, extra rdiff-backups) to try
> to tax it. Based on my previous criteria I would say your since-4.9 as
> of the above commit solves my bug, at least over a 3 day test span
> (which it never survives when the bug is present)!
>
> I tested WITHOUT any cgroup/mem boot options. I do still have my
> mem=6G limiter on, though (I've never tested with it off, until I solve
> the bug with it on, since I've had it on for many months for other
> reasons).

Good news indeed.

>
> On 2017-01-27 Michal Hocko wrote:
> > OK, that matches the theory that these OOMs are caused by the
> > incorrect active list aging fixed by b4536f0c829c ("mm, memcg: fix
> > the active list aging for lowmem requests when memcg is enabled")
>
> b4536f0c829c isn't in the since-4.9 I tested above though?

Yes this is a sha1 from Linus tree. The same commit is in the since-4.9
branch under 0759e73ee689f2066a4d64dd90ec5cc3fed28f86. There are some
more fixes on top of course.

> So
> something else you did must have fixed it (also)? I don't think I've
> run any tests yet with b4536f0c829c in them? I think the vanillas I
> was doing a couple of weeks ago were before b4536f0c829c, but I can't
> be sure.
>
> What do I test next? Does the since-4.9 stuff get pushed into vanilla
> (4.9 hopefully?) so it can find its way into Fedora's stuck F24
> kernel?

Testing with Valinall rc6 released just yesterday would be a good fit.
There are some more fixes sitting on mmotm on top and maybe we want some of them
in finall 4.10. Anyway all those pending changes should be merged in the
next merge window - aka 4.11

> I want to also note that the RHBZ
> https://bugzilla.redhat.com/show_bug.cgi?id=1401012 is garnering more
> interest as more people start me-too'ing. The situation is almost
> always the same: large rsync's or similar tree-scan accesses cause oom
> on PAE boxes.

I believe your instructions in comment 20 covers it nicely. If the
problem still persists with the current mmotm tree I would suggest
writing to the mailing list (feel free to CC me) and we will have a
look. Thanks!

> However, I wanted to note that many people there reported
> that cgroup_disable=memory doesn't fix anything for them, whereas that
> always makes the problem go away on my boxes. Strange.
>
> Thanks Michal and Mel, I really appreciate it!

--
Michal Hocko
SUSE Labs