Re: Can't we use timeout based OOM warning/killing?

From: Tetsuo Handa
Date: Mon Oct 12 2015 - 02:44:22 EST


Tetsuo Handa wrote:
> So, zapping the first OOM victim's mm might fail by chance.

I retested with a slightly different version.

---------- Reproducer start ----------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>
#include <sys/mman.h>

static int writer(void *unused)
{
const int fd = open("/proc/self/exe", O_RDONLY);
while (1) {
void *ptr = mmap(NULL, 4096, PROT_READ, MAP_PRIVATE, fd, 0);
munmap(ptr, 4096);
}
return 0;
}

int main(int argc, char *argv[])
{
char buffer[128] = { };
const pid_t pid = fork();
if (pid == 0) { /* down_write(&mm->mmap_sem) requester which is chosen as an OOM victim. */
int i;
for (i = 0; i < 9; i++)
clone(writer, malloc(1024) + 1024, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL);
writer(NULL);
}
snprintf(buffer, sizeof(buffer) - 1, "/proc/%u/stat", pid);
if (fork() == 0) { /* down_read(&mm->mmap_sem) requester. */
const int fd = open(buffer, O_RDONLY);
while (pread(fd, buffer, sizeof(buffer), 0) > 0);
_exit(0);
} else { /* A dummy process for invoking the OOM killer. */
char *buf = NULL;
unsigned long size = 0;
const int fd = open("/dev/zero", O_RDONLY);
for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
char *cp = realloc(buf, size);
if (!cp) {
size >>= 1;
break;
}
buf = cp;
}
read(fd, buf, size); /* Will cause OOM due to overcommit */
return 0;
}
}
---------- Reproducer end ----------

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20151012.txt.xz .

Uptime between 101 and 300 is a silent hang up (i.e. no OOM killer messages,
no SIGKILL pending tasks, no TIF_MEMDIE tasks) which I solved using SysRq-f
at uptime = 289. I don't know the reason of this silent hang up, but the
memory unzapping kernel thread will not help because there is no OOM victim.

----------
[ 101.438951] MemAlloc-Info: 10 stalling task, 0 dying task, 0 victim task.
(...snipped...)
[ 111.817922] MemAlloc-Info: 12 stalling task, 0 dying task, 0 victim task.
(...snipped...)
[ 122.281828] MemAlloc-Info: 13 stalling task, 0 dying task, 0 victim task.
(...snipped...)
[ 132.793724] MemAlloc-Info: 14 stalling task, 0 dying task, 0 victim task.
(...snipped...)
[ 143.336154] MemAlloc-Info: 16 stalling task, 0 dying task, 0 victim task.
(...snipped...)
[ 289.343187] sysrq: SysRq : Manual OOM execution
(...snipped...)
[ 292.065650] MemAlloc-Info: 16 stalling task, 0 dying task, 0 victim task.
(...snipped...)
[ 302.590736] kworker/3:2 invoked oom-killer: gfp_mask=0x24000c0, order=-1, oom_score_adj=0
(...snipped...)
[ 302.690047] MemAlloc-Info: 4 stalling task, 0 dying task, 0 victim task.
----------

Uptime between 379 and 605 is a mmap_sem livelock after the OOM killer was
invoked.

----------
[ 380.039897] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
[ 380.042500] [ 467] 0 467 14047 1815 28 3 0 0 systemd-journal
[ 380.045055] [ 482] 0 482 10413 259 23 3 0 -1000 systemd-udevd
[ 380.047637] [ 504] 0 504 12795 119 25 3 0 -1000 auditd
[ 380.050127] [ 1244] 0 1244 82428 4257 81 3 0 0 firewalld
[ 380.052536] [ 1247] 70 1247 6988 61 21 3 0 0 avahi-daemon
[ 380.055028] [ 1250] 0 1250 54104 1372 42 4 0 0 rsyslogd
[ 380.057505] [ 1251] 0 1251 137547 2620 91 3 0 0 tuned
[ 380.059996] [ 1255] 0 1255 4823 77 15 3 0 0 irqbalance
[ 380.062552] [ 1256] 0 1256 1095 37 8 3 0 0 rngd
[ 380.065020] [ 1259] 0 1259 53626 441 60 3 0 0 abrtd
[ 380.067383] [ 1260] 0 1260 53001 341 58 5 0 0 abrt-watch-log
[ 380.069965] [ 1265] 0 1265 8673 83 21 3 0 0 systemd-logind
[ 380.072554] [ 1266] 81 1266 6663 117 18 3 0 -900 dbus-daemon
[ 380.075122] [ 1272] 0 1272 31577 154 21 3 0 0 crond
[ 380.077544] [ 1314] 70 1314 6988 57 19 3 0 0 avahi-daemon
[ 380.080013] [ 1427] 0 1427 46741 225 44 3 0 0 vmtoolsd
[ 380.082478] [ 1969] 0 1969 25942 3100 48 3 0 0 dhclient
[ 380.084969] [ 1990] 999 1990 128626 1929 50 4 0 0 polkitd
[ 380.087516] [ 2073] 0 2073 20629 214 45 3 0 -1000 sshd
[ 380.090065] [ 2201] 0 2201 7320 68 21 3 0 0 xinetd
[ 380.092465] [ 3215] 0 3215 22773 257 44 3 0 0 master
[ 380.094879] [ 3217] 89 3217 22816 249 45 3 0 0 qmgr
[ 380.097304] [ 3249] 0 3249 75245 315 97 3 0 0 nmbd
[ 380.099666] [ 3259] 0 3259 92963 486 131 5 0 0 smbd
[ 380.101956] [ 3282] 0 3282 27503 30 12 3 0 0 agetty
[ 380.104277] [ 3283] 0 3283 21788 154 49 3 0 0 login
[ 380.106574] [ 3286] 0 3286 92963 486 126 5 0 0 smbd
[ 380.108835] [ 3296] 1000 3296 28864 117 13 3 0 0 bash
[ 380.111073] [ 3374] 89 3374 22799 249 46 3 0 0 pickup
[ 380.113298] [ 3378] 89 3378 22836 252 45 3 0 0 cleanup
[ 380.115555] [ 3385] 89 3385 22800 248 44 3 0 0 trivial-rewrite
[ 380.117811] [ 3392] 0 3392 22825 265 48 3 0 0 local
[ 380.119995] [ 3393] 0 3393 30828 59 17 3 0 0 anacron
[ 380.122183] [ 3417] 1000 3417 541715 397587 787 6 0 0 a.out
[ 380.124315] [ 3418] 1000 3418 1081 24 8 3 0 0 a.out
[ 380.126410] [ 3419] 1000 3419 1042 21 7 3 0 0 a.out
[ 380.128535] Out of memory: Kill process 3417 (a.out) score 890 or sacrifice child
[ 380.130392] Killed process 3418 (a.out) total-vm:4324kB, anon-rss:96kB, file-rss:0kB
[ 392.704028] MemAlloc-Info: 7 stalling task, 10 dying task, 1 victim task.
(...snipped...)
[ 601.129977] a.out R running task 0 3417 3296 0x00000080
[ 601.131899] ffff8800774dba10 ffffffff8112b174 0000000000000100 0000000000000000
[ 601.134026] 0000000000000000 0000000000000000 00000000a23cb49d 0000000000000000
[ 601.136076] ffff880077603200 00000000024280ca 0000000000000000 ffff880077603200
[ 601.138090] Call Trace:
[ 601.139145] [<ffffffff8112b174>] ? try_to_free_pages+0x94/0xc0
[ 601.140831] [<ffffffff8111a8c4>] ? out_of_memory+0x2f4/0x460
[ 601.142489] [<ffffffff8111fa63>] ? __alloc_pages_nodemask+0x613/0xc30
[ 601.144328] [<ffffffff81161c40>] ? alloc_pages_vma+0xb0/0x200
[ 601.145994] [<ffffffff81143056>] ? handle_mm_fault+0xfa6/0x1370
[ 601.147677] [<ffffffff8162f557>] ? native_iret+0x7/0x7
[ 601.149258] [<ffffffff81058217>] ? __do_page_fault+0x177/0x400
[ 601.150966] [<ffffffff810584d0>] ? do_page_fault+0x30/0x80
[ 601.152625] [<ffffffff81630518>] ? page_fault+0x28/0x30
[ 601.154159] [<ffffffff813230c0>] ? __clear_user+0x20/0x50
[ 601.155723] [<ffffffff81327a68>] ? iov_iter_zero+0x68/0x250
[ 601.157329] [<ffffffff813fc6c8>] ? read_iter_zero+0x38/0xa0
[ 601.158923] [<ffffffff81187f04>] ? __vfs_read+0xc4/0xf0
[ 601.160453] [<ffffffff8118868a>] ? vfs_read+0x7a/0x120
[ 601.161961] [<ffffffff811893a0>] ? SyS_read+0x50/0xc0
[ 601.163513] [<ffffffff8162e9ee>] ? entry_SYSCALL_64_fastpath+0x12/0x71
[ 601.165254] a.out D ffff8800777b7e08 0 3418 3417 0x00100084
[ 601.167118] ffff8800777b7e08 ffff880077606400 ffff8800777b8000 ffff880036032e00
[ 601.169137] ffff880036032de8 ffffffff00000000 ffffffff00000001 ffff8800777b7e20
[ 601.171159] ffffffff8162a570 ffff880077606400 ffff8800777b7ea8 ffffffff8162d8eb
[ 601.173183] Call Trace:
[ 601.174193] [<ffffffff8162a570>] schedule+0x30/0x80
[ 601.175661] [<ffffffff8162d8eb>] rwsem_down_write_failed+0x1fb/0x350
[ 601.177388] [<ffffffff81322f64>] ? call_rwsem_down_read_failed+0x14/0x30
[ 601.179194] [<ffffffff81322f93>] call_rwsem_down_write_failed+0x13/0x20
[ 601.180971] [<ffffffff8162d05f>] ? down_write+0x1f/0x30
[ 601.182509] [<ffffffff81147abe>] vm_munmap+0x2e/0x60
[ 601.183992] [<ffffffff811489fd>] SyS_munmap+0x1d/0x30
[ 601.185485] [<ffffffff8162e9ee>] entry_SYSCALL_64_fastpath+0x12/0x71
[ 601.187224] a.out D ffff88007c60fdf0 0 3420 3417 0x00000084
[ 601.189130] ffff88007c60fdf0 ffff880078e15780 ffff88007c610000 ffff880036032de8
[ 601.191158] ffff880036032e00 ffff88007c60ff58 ffff880078e15780 ffff88007c60fe08
[ 601.193180] ffffffff8162a570 ffff880078e15780 ffff88007c60fe68 ffffffff8162d698
[ 601.195217] Call Trace:
[ 601.196226] [<ffffffff8162a570>] schedule+0x30/0x80
[ 601.197683] [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150
[ 601.199407] [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30
[ 601.201192] [<ffffffff8162d032>] ? down_read+0x12/0x20
[ 601.202711] [<ffffffff810583f7>] __do_page_fault+0x357/0x400
[ 601.204328] [<ffffffff810584d0>] do_page_fault+0x30/0x80
[ 601.205874] [<ffffffff81630518>] page_fault+0x28/0x30
[ 601.207376] a.out D ffff88007c24fdf0 0 3421 3417 0x00000084
[ 601.209286] ffff88007c24fdf0 ffff880078e13200 ffff88007c250000 ffff880036032de8
[ 601.211316] ffff880036032e00 ffff88007c24ff58 ffff880078e13200 ffff88007c24fe08
[ 601.213335] ffffffff8162a570 ffff880078e13200 ffff88007c24fe68 ffffffff8162d698
[ 601.215356] Call Trace:
[ 601.216377] [<ffffffff8162a570>] schedule+0x30/0x80
[ 601.217831] [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150
[ 601.219529] [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30
[ 601.221296] [<ffffffff8162d032>] ? down_read+0x12/0x20
[ 601.222802] [<ffffffff810583f7>] __do_page_fault+0x357/0x400
[ 601.224403] [<ffffffff810584d0>] do_page_fault+0x30/0x80
[ 601.225958] [<ffffffff81630518>] page_fault+0x28/0x30
[ 601.227453] a.out D ffff88007823bdf0 0 3422 3417 0x00000084
[ 601.229348] ffff88007823bdf0 ffff880078e10000 ffff88007823c000 ffff880036032de8
[ 601.231395] ffff880036032e00 ffff88007823bf58 ffff880078e10000 ffff88007823be08
[ 601.233427] ffffffff8162a570 ffff880078e10000 ffff88007823be68 ffffffff8162d698
[ 601.235472] Call Trace:
[ 601.236504] [<ffffffff8162a570>] schedule+0x30/0x80
[ 601.237989] [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150
[ 601.239720] [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30
[ 601.241583] [<ffffffff8162d032>] ? down_read+0x12/0x20
[ 601.243144] [<ffffffff810583f7>] __do_page_fault+0x357/0x400
[ 601.244777] [<ffffffff810584d0>] do_page_fault+0x30/0x80
[ 601.246307] [<ffffffff81630518>] page_fault+0x28/0x30
[ 601.247823] a.out D ffff88007c483df0 0 3423 3417 0x00000084
[ 601.249719] ffff88007c483df0 ffff880078e13e80 ffff88007c484000 ffff880036032de8
[ 601.251765] ffff880036032e00 ffff88007c483f58 ffff880078e13e80 ffff88007c483e08
[ 601.253808] ffffffff8162a570 ffff880078e13e80 ffff88007c483e68 ffffffff8162d698
[ 601.255831] Call Trace:
[ 601.256850] [<ffffffff8162a570>] schedule+0x30/0x80
[ 601.258286] [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150
[ 601.260005] [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30
[ 601.261803] [<ffffffff8162d032>] ? down_read+0x12/0x20
[ 601.263329] [<ffffffff810583f7>] __do_page_fault+0x357/0x400
[ 601.264936] [<ffffffff810584d0>] do_page_fault+0x30/0x80
[ 601.266504] [<ffffffff81630518>] page_fault+0x28/0x30
[ 601.268019] a.out D ffff880035893e08 0 3424 3417 0x00000084
[ 601.269940] ffff880035893e08 ffff880078e17080 ffff880035894000 ffff880036032e00
[ 601.271945] ffff880036032de8 ffffffff00000000 ffffffff00000001 ffff880035893e20
[ 601.273954] ffffffff8162a570 ffff880078e17080 ffff880035893ea8 ffffffff8162d8eb
[ 601.276000] Call Trace:
[ 601.277007] [<ffffffff8162a570>] schedule+0x30/0x80
[ 601.278497] [<ffffffff8162d8eb>] rwsem_down_write_failed+0x1fb/0x350
[ 601.280240] [<ffffffff81322f64>] ? call_rwsem_down_read_failed+0x14/0x30
[ 601.282058] [<ffffffff81322f93>] call_rwsem_down_write_failed+0x13/0x20
[ 601.283872] [<ffffffff8162d05f>] ? down_write+0x1f/0x30
[ 601.285403] [<ffffffff81147abe>] vm_munmap+0x2e/0x60
[ 601.286924] [<ffffffff811489fd>] SyS_munmap+0x1d/0x30
[ 601.288435] [<ffffffff8162e9ee>] entry_SYSCALL_64_fastpath+0x12/0x71
[ 601.290184] a.out D ffff8800353b7df0 0 3425 3417 0x00000084
[ 601.292108] ffff8800353b7df0 ffff880078e10c80 ffff8800353b8000 ffff880036032de8
[ 601.294165] ffff880036032e00 ffff8800353b7f58 ffff880078e10c80 ffff8800353b7e08
[ 601.296206] ffffffff8162a570 ffff880078e10c80 ffff8800353b7e68 ffffffff8162d698
[ 601.298267] Call Trace:
[ 601.299300] [<ffffffff8162a570>] schedule+0x30/0x80
[ 601.300755] [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150
[ 601.302437] [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30
[ 601.304221] [<ffffffff8162d032>] ? down_read+0x12/0x20
[ 601.305764] [<ffffffff810583f7>] __do_page_fault+0x357/0x400
[ 601.307389] [<ffffffff810584d0>] do_page_fault+0x30/0x80
[ 601.308968] [<ffffffff81630518>] page_fault+0x28/0x30
[ 601.310488] a.out D ffff88007cf87df0 0 3426 3417 0x00000084
[ 601.312380] ffff88007cf87df0 ffff880078e16400 ffff88007cf88000 ffff880036032de8
[ 601.314414] ffff880036032e00 ffff88007cf87f58 ffff880078e16400 ffff88007cf87e08
[ 601.316443] ffffffff8162a570 ffff880078e16400 ffff88007cf87e68 ffffffff8162d698
[ 601.318490] Call Trace:
[ 601.319536] [<ffffffff8162a570>] schedule+0x30/0x80
[ 601.321036] [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150
[ 601.322763] [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30
[ 601.324504] [<ffffffff8162d032>] ? down_read+0x12/0x20
[ 601.326071] [<ffffffff810583f7>] __do_page_fault+0x357/0x400
[ 601.327715] [<ffffffff810584d0>] do_page_fault+0x30/0x80
[ 601.329287] [<ffffffff81630518>] page_fault+0x28/0x30
[ 601.330761] a.out D ffff8800792dfdf0 0 3427 3417 0x00000084
[ 601.332705] ffff8800792dfdf0 ffff880078e12580 ffff8800792e0000 ffff880036032de8
[ 601.334699] ffff880036032e00 ffff8800792dff58 ffff880078e12580 ffff8800792dfe08
[ 601.336750] ffffffff8162a570 ffff880078e12580 ffff8800792dfe68 ffffffff8162d698
[ 601.338794] Call Trace:
[ 601.339781] [<ffffffff8162a570>] schedule+0x30/0x80
[ 601.341280] [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150
[ 601.343009] [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30
[ 601.344813] [<ffffffff8162d032>] ? down_read+0x12/0x20
[ 601.346361] [<ffffffff810583f7>] __do_page_fault+0x357/0x400
[ 601.347990] [<ffffffff810584d0>] do_page_fault+0x30/0x80
[ 601.349521] [<ffffffff81630518>] page_fault+0x28/0x30
[ 601.351044] a.out D ffff88007743faa8 0 3428 3417 0x00000084
[ 601.352942] ffff88007743faa8 ffff88007bda6400 ffff880077440000 ffff88007743fae0
[ 601.354990] ffff88007fccdfc0 00000001000484e5 0000000000000000 ffff88007743fac0
[ 601.357024] ffffffff8162a570 ffff88007fccdfc0 ffff88007743fb40 ffffffff8162dbed
[ 601.359075] Call Trace:
[ 601.360096] [<ffffffff8162a570>] schedule+0x30/0x80
[ 601.361540] [<ffffffff8162dbed>] schedule_timeout+0x11d/0x1c0
[ 601.363190] [<ffffffff810c7e00>] ? cascade+0x90/0x90
[ 601.364697] [<ffffffff8162dce9>] schedule_timeout_uninterruptible+0x19/0x20
[ 601.366574] [<ffffffff8111fc9d>] __alloc_pages_nodemask+0x84d/0xc30
[ 601.368332] [<ffffffff811609a7>] alloc_pages_current+0x87/0x110
[ 601.370002] [<ffffffff811166cf>] __page_cache_alloc+0xaf/0xc0
[ 601.371606] [<ffffffff81119225>] filemap_fault+0x1e5/0x420
[ 601.373203] [<ffffffff81244f39>] xfs_filemap_fault+0x39/0x60
[ 601.374798] [<ffffffff8113d5e7>] __do_fault+0x47/0xd0
[ 601.376315] [<ffffffff81142ec5>] handle_mm_fault+0xe15/0x1370
[ 601.377938] [<ffffffff81322f64>] ? call_rwsem_down_read_failed+0x14/0x30
[ 601.379707] [<ffffffff81058217>] __do_page_fault+0x177/0x400
[ 601.381320] [<ffffffff810584d0>] do_page_fault+0x30/0x80
[ 601.382831] [<ffffffff81630518>] page_fault+0x28/0x30
[ 601.384337] a.out R running task 0 3419 3417 0x00000080
[ 601.386257] 00000000f80745e8 ffff880034ab4400 ffff8800776d3f18 ffff8800776d3f18
[ 601.388287] 0000000000000080 0000000000000000 ffff8800776d3ec8 ffffffff81187e72
[ 601.390341] ffff880034ab4400 ffff880034ab4410 0000000000020000 0000000000000000
[ 601.392366] Call Trace:
[ 601.393388] [<ffffffff81187e72>] ? __vfs_read+0x32/0xf0
[ 601.394952] [<ffffffff81290aa9>] ? security_file_permission+0xa9/0xc0
[ 601.396745] [<ffffffff8118858d>] ? rw_verify_area+0x4d/0xd0
[ 601.398359] [<ffffffff8118868a>] ? vfs_read+0x7a/0x120
[ 601.399897] [<ffffffff81189560>] ? SyS_pread64+0x90/0xb0
[ 601.401429] [<ffffffff8162e9ee>] ? entry_SYSCALL_64_fastpath+0x12/0x71
----------

I think that I noticed three problems from this reproducer.

(1) While the likeliness of hitting mmap_sem livelock would depend on how
frequently down_read(&mm->mmap_sem) tasks and down_write(&mm->mmap_sem)
tasks contend on the OOM victim's mm, we can hit mmap_sem livelock with
even only one down_read(&mm->mmap_sem) task. On systems where processes
are monitored using /proc/pid/ interface, we can by chance hit this
mmap_sem livelock.

(2) The OOM killer tries to kill child process of the memory hog. But the
child process is not always consuming a lot of memory. The memory
unzapping kernel thread might not be able to reclaim enough memory
unless we choose subsequent OOM victims when the first OOM victim task
got mmap_sem livelock.

(3) I don't know the reason but I can observe that (when there are many
tasks which got SIGKILL by the OOM killer) many of dying tasks participate
in a memory allocation competition via page_fault() which cannot make
forward progress because dying tasks without TIF_MEMDIE are not allowed
to access the memory reserves.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/