Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

From: Trevor Cordes
Date: Wed Feb 01 2017 - 04:29:57 EST


On 2017-01-30 Michal Hocko wrote:
> On Sun 29-01-17 16:50:03, Trevor Cordes wrote:
> > On 2017-01-25 Michal Hocko wrote:
> > > On Wed 25-01-17 04:02:46, Trevor Cordes wrote:
> > > > OK, I patched & compiled mhocko's git tree from the other day
> > > > 4.9.0+. (To confirm, weird, but mhocko's git tree I'm using
> > > > from a couple of weeks ago shows the newest commit (git log) is
> > > > 69973b830859bc6529a7a0468ba0d80ee5117826 "Linux 4.9"? Let me
> > > > know if I'm doing something wrong, see below.)
> > >
> > > My fault. I should have noted that you should use since-4.9
> > > branch.
> >
> > OK, I have good news. I compiled your mhocko git tree (properly
> > this tim!) using since-4.9 branch (last commit
> > ca63ff9b11f958efafd8c8fa60fda14baec6149c Jan 25) and the box
> > survived 3 3am's, over 60 hours, and I made sure all the usual oom
> > culprits ran, and I ran extras (finds on the whole tree, extra
> > rdiff-backups) to try to tax it. Based on my previous criteria I
> > would say your since-4.9 as of the above commit solves my bug, at
> > least over a 3 day test span (which it never survives when the bug
> > is present)!
> >
> > I tested WITHOUT any cgroup/mem boot options. I do still have my
> > mem=6G limiter on, though (I've never tested with it off, until I
> > solve the bug with it on, since I've had it on for many months for
> > other reasons).
>
> Good news indeed.

Even better, another guy on the rhbz reported the mhocko git tree
since-4.9 solves the bug for him too! And it ran another night (4+
total) without problems on my box. Whatever is in since-4.9 fixes it,
as I reported before.

But...

> Testing with Valinall rc6 released just yesterday would be a good fit.
> There are some more fixes sitting on mmotm on top and maybe we want
> some of them in finall 4.10. Anyway all those pending changes should
> be merged in the next merge window - aka 4.11

After 30 hours of running vanilla 4.10.0-rc6, the box started to go
bonkers at 3am, so vanilla does not fix the bug :-( But, the bug hit
differently this time, the box just bogged down like crazy and gave
really weird top output. Starting nano would take 10s, then would run
full speed, then when saving a file would take 5s. Starting any prog
not in cache took equally as long.

However, no oom hit. I waited about 15 minutes and things seemed to
bog more, so I rebooted into since-4.9. Maybe if I had kept waiting
the box would have oom'd, but I didn't want to take the chance (it's
remote, and I can't reset it).

I did capture a lot of the weird top, meminfo and slabinfo data before
rebooting. I'll attached the output to this email. Messages show a
lot of "page allocation stalls" during the bogged-down time.

So my hunch at this moment is 4.10.0-rc6 might help alleviate the
problem somewhat, but it's other things you have in since-4.9 that
solve it completely.

Let me know if you need any more testing or some bisecting or
something. I'll keep on running since-4.9 in the meantime. Thanks!

Attachment: 4.10.rc6-bogged
Description: Binary data