Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

From: Trevor Cordes
Date: Wed Jan 25 2017 - 05:23:50 EST

On 2017-01-23 Mel Gorman wrote:
> On Sun, Jan 22, 2017 at 06:45:59PM -0600, Trevor Cordes wrote:
> > On 2017-01-20 Mel Gorman wrote:
> > > >
> > > > Thanks for the OOM report. I was expecting it to be a particular
> > > > shape and my expectations were not matched so it took time to
> > > > consider it further. Can you try the cumulative patch below? It
> > > > combines three patches that
> > > >
> > > > 1. Allow slab shrinking even if the LRU patches are
> > > > unreclaimable in direct reclaim
> > > > 2. Shrinks slab based once based on the contents of all memcgs
> > > > instead of shrinking one at a time
> > > > 3. Tries to shrink slabs if the lowmem usage is too high
> > > >
> > > > Unfortunately it's only boot tested on x86-64 as I didn't get
> > > > the chance to setup an i386 test bed.
> > > >
> > >
> > > There was one major flaw in that patch. This version fixes it and
> > > addresses other minor issues. It may still be too agressive
> > > shrinking slab but worth trying out. Thanks.
> >
> > I ran with your patch below and it oom'd on the first night. It was
> > weird, it didn't hang the system, and my rebooter script started a
> > reboot but the system never got more than half down before it just
> > sat there in a weird state where a local console user could still
> > login but not much was working. So the patches don't seem to solve
> > the problem.
> >
> > For the above compile I applied your patches to 4.10.0-rc4+, I hope
> > that's ok.
> >
> It would be strongly preferred to run them on top of Michal's other
> fixes. The main reason it's preferred is because this OOM differs from
> earlier ones in that it OOM killed from GFP_NOFS|__GFP_NOFAIL context.
> That meant that the slab shrinking could not happen from direct
> reclaim so the balancing from my patches would not occur. As
> Michal's other patches affect how kswapd behaves, it's important.

OK, I patched & compiled mhocko's git tree from the other day 4.9.0+.
(To confirm, weird, but mhocko's git tree I'm using from a couple of
weeks ago shows the newest commit (git log) is
69973b830859bc6529a7a0468ba0d80ee5117826 "Linux 4.9"? Let me know if
I'm doing something wrong, see below.)

Anyhow, it oom'd as usual at ~3am, system froze after 20 ooms hit in 7
secs. So no help there. Attached is the oom log from the first oom

On 2017-01-24 Michal Hocko wrote:
> On Sun 22-01-17 18:45:59, Trevor Cordes wrote:
> [...]
> > Also, completely separate from your patch I ran mhocko's 4.9 tree
> > with mem=2G to see if lower ram amount would help, but it didn't.
> > Even with 2G the system oom and hung same as usual. So far the
> > only thing that helps at all was the cgroup_disable=memory option,
> > which makes the problem disappear completely for me.
> OK, can we reduce the problem space slightly more and could you boot
> with kmem accounting enabled? cgroup.memory=nokmem,nosocket

I will try that right now, I'll use the mhocko git tree without Mel's
emailed patch, and I'll refresh the git tree from origin first (let me
know that's a bad move). As usual, I'll report back within 24-48 hours.

Actually, on my tests with mhocko git tree, I'm a bit confused and want
to make sure I'm compiling the right thing. His tree doesn't seem to
have recent commits? I did "git fetch origin" and "git reset --hard
origin/master" to refresh the tree just now and the latest commit is
still the one shown above "Linux 4.9"? Is Michal making changes but
not comitting? How do I ensure I'm compiling the version you guys want
me to test? ("git log mm/vmscan.c" shows newest commit is Dec 2??) Am
I supposed to be testing a specific branch?

If I've been testing the wrong branch, this *only* affects my mhocko
tree tests (not the vanilla or fedora-patched tests). Thankfully I
think I've only done 1 or 2 mhocko tree tests, and I can easily redo
them. If this turns out to be the case, I'm so sorry for the
confusion, the non-vanilla git tree thing is all new to me.

In any event, I'm still trying the above, and will adjust if necessary
if it's confirmed I'm doing something wrong with the mhocko git tree.

Attachment: oom5
Description: Binary data