Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

From: Trevor Cordes
Date: Wed Jan 18 2017 - 02:29:52 EST


On 2017-01-17 Michal Hocko wrote:
> On Tue 17-01-17 14:21:14, Mel Gorman wrote:
> > On Tue, Jan 17, 2017 at 02:52:28PM +0100, Michal Hocko wrote:
> > > On Mon 16-01-17 11:09:34, Mel Gorman wrote:
> > > [...]
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 532a2a750952..46aac487b89a 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -2684,6 +2684,7 @@ static void shrink_zones(struct zonelist
> > > > *zonelist, struct scan_control *sc) continue;
> > > >
> > > > if (sc->priority != DEF_PRIORITY &&
> > > > + !buffer_heads_over_limit &&
> > > > !pgdat_reclaimable(zone->zone_pgdat))
> > > > continue; /* Let kswapd
> > > > poll it */
> > >
> > > I think we should rather remove pgdat_reclaimable here. This
> > > sounds like a wrong layer to decide whether we want to reclaim
> > > and how much.
> >
> > I had considered that but it'd also be important to add the other
> > 32-bit patches you have posted to see the impact. Because of the
> > ratio of LRU pages to slab pages, it may not have an impact but
> > it'd need to be eliminated.
>
> OK, Trevor you can pull from
> git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git tree
> fixes/highmem-node-fixes branch. This contains the current mmotm tree
> + the latest highmem fixes. I also do not expect this would help much
> in your case but as Mel've said we should rule that out at least.

OK, ignore my last question re: what to do next. I am building
this mhocko git tree now per your above instructions and will reboot
into it in a few hours with*out* the cgroup_disable=memory option.
Might take ~50 hours for a result.

I should note that the workload on the box with the bug is mostly as a
file server and iptables firewall/router. It routes around 8GB(ytes) a
day, and periodic file server loads. That's about it. Everything else
that is running is not doing much, and not using much RAM; except
maybe clamav, by far the biggest RAM.

I don't see this bug on other nearly identical boxes, including:
F24 4.8.15 32-bit (no PAE) 1GB ram P4
F24 4.8.15 32-bit (no PAE) 2GB ram Core2 Quad

However, just noticed for the first time today that one other box is
also seeing this bug (gets an oom message), though with much less
frequency: twice in 2 months since upgrading to 4.8. However, it
recovers from the oom without a reboot and hasn't hanged (yet). That
could be because this box does not do as much file serving or I/O as
the one I've been building/testing on. Also, this box is a much older
Pentium-D with 4GB (PAE on). If it would be helpful to see its oom
log, let me know. (Scanning all my boxes now, I also found 1 single oom
on yet another 1 computer with the same story; but this is a Xeon
E3-1220 32-bit with PAE, 4GB.)

So far the commonality seems to be >2GB RAM and PAE on. Might be
interesting to boot my build/test box with mem=2G and isolate it to
small RAM vs PAE. "mem=2G" would make a great, easy, immediate
workaround for this problem for me (as cgroup_disable=memory also seems
to do, so far). Thanks!