memory bug ever since 3.12, oom-killer invoked, computer freezes

From: Trevor Cordes
Date: Tue Jul 08 2014 - 18:58:27 EST


Excuse a novice on his first post to this list. I have tried to obtain
help elsewhere with no success.

I have been dealing with a bad kernel bug since 3.12 came out. It is
present in 3.12, 3.13 and 3.14 up to 3.14.8 (Fedora 19 kernel).

What happens is around the same time every day, using the buggy
kernels, I get dozens of oom-killer messages over about 3-5 minutes,
the system slows to a crawl instantly, and usually freezes (numlock no
longer works, etc) within a few minutes.

Using 3.11, the system runs fine, there is no bug.

I think I have isolated the trigger of the problem to a simple
backup-helper script I run nightly at the same time. I have come to
this conclusion based on the fact I can run in 3.14 for many days with
no problems if I disable my script from running. As soon as I enable
the script, the bug will hit the subsequent morning at the same time as
usual. Again, in 3.11 there is no bug even if my script is running.

I have made a RH bugzilla bug for this that contains even more detail:
https://bugzilla.redhat.com/show_bug.cgi?id=1075185

My script looks like this (simplified):
#!/bin/perl
$dirs="/ /mnt/peecee/DATA";
$Ddest="/data/Bak/FindList";
system "/bin/nice -n19 /usr/bin/ionice -c2 -n7 -t find $dirs -xdev -ls
2>/dev/null > $Ddest/find-list";

Notes: /mnt/peecee is a cifs share (old XP box). $Ddest is an NFS
mount on my file server.

This script runs in about 1 min when nothing is cached, about 10s when
everything is cached.

I can run this script 200 times over and over again manually for
testing (not via the usual cron) and it does NOT trigger the bug. It
is only when I enable this script via cron that the bug occurs.

I have captured key /proc files at moments in time before/during the
bug occurring, which may help figure out the problem. I have attached
those files to the bugzilla linked above. I can post them here if
required. I can obtain more/finer results if required. I can
reproduce this bug "sort of on demand" by enabling my script to run the
following morning.

Known buggy kernels:
3.14.8-100.fc19
3.14.4-100.fc19
3.13.9-100.fc19
3.13.5-103.fc19
3.12.9-201.fc19

Known good kernel:
3.11.10-200.fc19

My kernels are all 32-bit, PAE.

My / is md RAID1. The disks are 15k UW-SCSI enterprise drives. The
controller is Adaptec AIC-7892A U160/m, a 29160 card I believe. I am
usually tainted with Nvidia video driver binary, but can untaint for
purposes of testing.

I wanted to bisect to help figure this out but cannot using Fedora
tools due to bug in 32-bit python libraries. I don't know how to
bisect the vanilla kernel whilst still incorporating all Fedora tweaks
without using Fedora tools.

I did much googling and discovered this thread which sounds very much
related to my problem, though not an exact duplicate:
http://marc.info/?l=linux-mm&m=139267140606805&w=2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/