Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

From: Trevor Cordes
Date: Fri Jan 20 2017 - 01:38:05 EST


On 2017-01-19 Michal Hocko wrote:
> On Thu 19-01-17 03:48:50, Trevor Cordes wrote:
> > On 2017-01-17 Michal Hocko wrote:
> > > On Tue 17-01-17 14:21:14, Mel Gorman wrote:
> > > > On Tue, Jan 17, 2017 at 02:52:28PM +0100, Michal Hocko
> > > > wrote:
> > > > > On Mon 16-01-17 11:09:34, Mel Gorman wrote:
> > > > > [...]
> > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > > index 532a2a750952..46aac487b89a 100644
> > > > > > --- a/mm/vmscan.c
> > > > > > +++ b/mm/vmscan.c
> > > > > > @@ -2684,6 +2684,7 @@ static void shrink_zones(struct
> > > > > > zonelist *zonelist, struct scan_control *sc) continue;
> > > > > >
> > > > > > if (sc->priority != DEF_PRIORITY &&
> > > > > > + !buffer_heads_over_limit &&
> > > > > > !pgdat_reclaimable(zone->zone_pgdat))
> > > > > > continue; /* Let
> > > > > > kswapd poll it */
> > > > >
> > > > > I think we should rather remove pgdat_reclaimable here. This
> > > > > sounds like a wrong layer to decide whether we want to reclaim
> > > > > and how much.
> > > >
> > > > I had considered that but it'd also be important to add the
> > > > other 32-bit patches you have posted to see the impact. Because
> > > > of the ratio of LRU pages to slab pages, it may not have an
> > > > impact but it'd need to be eliminated.
> > >
> > > OK, Trevor you can pull from
> > > git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git tree
> > > fixes/highmem-node-fixes branch. This contains the current mmotm
> > > tree
> > > + the latest highmem fixes. I also do not expect this would help
> > > much in your case but as Mel've said we should rule that out at
> > > least.
> >
> > Hi! The git tree above version oom'd after < 24 hours (3:02am) so
> > it doesn't solve the bug. If you need a oom messages dump let me
> > know.
>
> Yes please.

The first oom from that night attached. Note, the oom wasn't as dire
with your mhocko/4.9.0+ as it usually is with stock 4.8.x: my oom
detector and reboot script was able to do its thing cleanly before the
system became unusable.

I'll await further instructions and test right away. Maybe I'll try a
few tuning ideas until then. Thanks!

> > Let me know what to try next, guys, and I'll test it out.
> >
> > > > Before prototyping such a thing, I'd like to hear the outcome of
> > > > this heavy hack and then add your 32-bit patches onto the list.
> > > > If the problem is still there then I'd next look at taking slab
> > > > pages into account in pgdat_reclaimable() instead of an
> > > > outright removal that has a much wider impact. If that doesn't
> > > > work then I'll prototype a heavy-handed forced slab reclaim
> > > > when lower zones are almost all slab pages.
> >
> > I don't think I've tried the "heavy hack" patch yet? It's not in
> > the mhocko tree I just tried? Should I try the heavy hack on top
> > of mhocko git or on vanilla or what?
> >
> > I also want to mention that these PAE boxes suffer from another
> > problem/bug that I've worked around for almost a year now. For some
> > reason it keeps gnawing at me that it might be related. The disk
> > I/O goes to pot on this/these PAE boxes after a certain amount of
> > disk writes (like some unknown number of GB, around 10-ish maybe).
> > Like writes go from 500MB/s to 10MB/s!! Reboot and it's magically
> > 500MB/s again. I detail this here:
> > https://muug.ca/pipermail/roundtable/2016-June/004669.html
> > My fix was to mem=XG where X is <8 (like 4 or 6) to force the PAE
> > kernel to be more sane about highmem choices. I never filed a bug
> > because I read a ton of stuff saying Linus hates PAE, don't use over
> > 4G, blah blah. But the other fix is to:
> > set /proc/sys/vm/highmem_is_dirtyable to 1
>
> Yes this sounds like a dirty memory throttling and there were some
> changes in that area. I do not remember when exactly.

I think my PAE-slow-IO bug started way back in Fedora 22 (4.0?), hard
to know exactly when as I didn't discover the bug for maybe a year as I
didn't realize IO was the problem right away. Too late to bisect that
one, I guess. I guess it's not related so we can ignore my tangent!

> > I'm not bringing this up to get attention to a new bug, I bring
> > this up because it smells like it might be related. If something
> > slowly eats away at the box's vm to the point that I/O gets
> > horribly slow, perhaps it's related to the slab and high/lomem
> > issue we have here? And if related, it may help to solve the oom
> > bug. If I'm way off base here, just ignore my tangent!
>
> >From your OOM reports so far it doesn't really seem related because
> >you
> never had large number of pages under the writeback when OOM.
>
> The situation with the PAE kernel is unfortunate but it is really hard
> to do anything about that considering that the kernel and most its
> allocations have to live in a small and scarce lowmem memory. Moreover
> the more memory you have to more you have to allocated from that
> memory.

You're for sure right that the IO-slow bug was definitely worse the more
ram was in a system! (The mem=4G really helps alleviate this bug and is
good enough for me.)

> This is why not only Linus hates 32b systems on a large memory
> systems.

Completely off-topic: it would be great if rather than pretending PAE
should work with large RAM (which seems more broken every day), the
kernel guys put out an officially stated policy of a maximum RAM you
can use, and try to have the kernel behave for <= that size, and then
people could use more RAM but clearly "at your own risk, don't bug us
about problems!". Other than a few posts about Linus hating it,
there's nothing official I can find about it in documentation, etc. It
gives the (mis)impression that it's perfectly fine to run PAE on a
zillion GB modern system. Then we later learn the hard way :-)

Attachment: oom3
Description: Binary data