mm, vmscan: commit makes PAE kernel crash nightly (bisected)

From: Trevor Cordes
Date: Wed Jan 11 2017 - 05:33:04 EST


Hi! I have biected a nightly oom-killer flood and crash/hang on one of
the boxes I admin. It doesn't crash on Fedora 23/24 4.7.10 kernel but
does on any 4.8 Fedora kernel. I did a vanilla bisect and the bug is
here:

commit b2e18757f2c9d1cdd746a882e9878852fdec9501
Author: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
Date: Thu Jul 28 15:45:37 2016 -0700

mm, vmscan: begin reclaiming pages on a per-node basis

I bisected between:
# bad: [69973b830859bc6529a7a0468ba0d80ee5117826] Linux 4.9
# good: [523d939ef98fd712632d93a5a2b588e477a7565e] Linux 4.7

I have not tried newer than 4.8.13 Fedora kernel, but if someone thinks
this bug is already fixed in HEAD I could try that next. It took 3 weeks
to bisect because the crash only seems to happen in the middle of the
night, and not every, but most, nights.

It does not occur on most of my other boxes, just this one. The box is a
bit unique in that it's running 32-bit PAE on a 64-bit capable CPU, and I
have the memory tuned down to mem=6G in the kernel command line (I think
it has 16GB actual). I tuned the RAM down because around 8GB the PAE
kernel has massive IO speed issues.

It is a relatively new Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz on an
Intel S1200BTL board. I will eventually change it to 64-bit Fedora which
I'm sure will solve this bug, but since there's no easy upgrade path,
that's on the backburner on this production box.

I'm sure this will be another "PAE sucks, don't use it" issue, but like I
said, I'm currently stuck with it, and in theory the kernel shouldn't
crash like this (I'm guessing/hoping).

I think I pinned the trigger down to either (or both) big dir scans (like
"find /bigdir-foo") running at around 3am. It's either a remote box doing
indexing via smbd and/or rsync or rdiff-backup also doing big dir scans.
But when I do "find /" manually I can't trigger the bug. Very weird.

The commit notes make it sound like the author thought perhaps there could
be a problem in some scenarios? I guess I found the scenario.

The only discussion I found on the net regarding this commit is
https://lkml.org/lkml/2016/8/29/154
And perhaps it's somewhat relevant, it's a bit over my head.

I'm available for testing, etc, and can usually rule out a bad kernel
within 24-hours by just waiting for 3am to roll around. I also have
copious logs I can provide and screenshots of the crashes.

The box is extremely lightly loaded, and RAM use is almost always under
1GB, and swap is 0-20k used most of the time with GB's free. Everything
looks great until all of a sudden oom-killer starts running and goes
through 10-260 iterations before the system just dies. I wrote a script
to watch for oom-killer and issue "reboot" immediately, but 80% of the
time the box will hang before the reboot actually manages to shutdown.

Any information/help I can provide, please just holler. Thanks!