Re: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.

From: Tetsuo Handa
Date: Tue Jan 12 2016 - 06:32:31 EST


Michal Hocko wrote:
> On Fri 08-01-16 00:38:43, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > @@ -333,6 +333,14 @@ static struct task_struct *select_bad_process(struct oom_control *oc,
> > > if (points == chosen_points && thread_group_leader(chosen))
> > > continue;
> > >
> > > + /*
> > > + * If the current major task is already ooom killed and this
> > > + * is sysrq+f request then we rather choose somebody else
> > > + * because the current oom victim might be stuck.
> > > + */
> > > + if (is_sysrq_oom(sc) && test_tsk_thread_flag(p, TIF_MEMDIE))
> > > + continue;
> > > +
> > > chosen = p;
> > > chosen_points = points;
> > > }
> >
> > Do we want to require SysRq-f for each thread in a process?
> > If g has 1024 p, dump_tasks() will do
> >
> > pr_info("[%5d] %5d %5d %8lu %8lu %7ld %7ld %8lu %5hd %s\n",
> >
> > for 1024 times? I think one SysRq-f per one process is sufficient.
>
> I am not following you here. If we kill the process the whole process
> group (aka all threads) will get killed which ever thread we happen to
> send the sigkill to.

Please distinguish "sending SIGKILL to a process" and "all threads in that
process terminate". do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true)
sends SIGKILL to a victim process, but it does not guarantee that all
threads in that process terminate even if the OOM reaper reclaimed memory.
That's when SysRq-f (and timeout based next victim selection) is needed
but currently SysRq-f forever continues selecting incorrect process.

I can observe SysRq-f is disabled
(Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20160112.txt.xz .)
----------
[ 86.767482] a.out invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|GFP_ZERO)
[ 86.769905] a.out cpuset=/ mems_allowed=0
[ 86.771393] CPU: 2 PID: 9573 Comm: a.out Not tainted 4.4.0-next-20160112+ #279
(...snipped...)
[ 86.874710] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
(...snipped...)
[ 86.945286] [ 9573] 1000 9573 541717 402522 796 6 0 0 a.out
[ 86.947457] [ 9574] 1000 9574 1078 21 7 3 0 0 a.out
[ 86.949568] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child
[ 86.951538] Killed process 9574 (a.out) total-vm:4312kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[ 86.955296] systemd-journal invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|GFP_COLD)
[ 86.958035] systemd-journal cpuset=/ mems_allowed=0
(...snipped...)
[ 87.128808] [ 9573] 1000 9573 541717 402522 796 6 0 0 a.out
[ 87.130926] [ 9575] 1000 9574 1078 0 7 3 0 0 a.out
[ 87.133055] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child
[ 87.134989] Killed process 9575 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 116.979564] sysrq: SysRq : Manual OOM execution
[ 116.984119] kworker/0:8 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
[ 116.986367] kworker/0:8 cpuset=/ mems_allowed=0
(...snipped...)
[ 117.157045] [ 9573] 1000 9573 541717 402522 797 6 0 0 a.out
[ 117.159191] [ 9575] 1000 9574 1078 0 7 3 0 0 a.out
[ 117.161302] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child
[ 117.163250] Killed process 9575 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 119.043685] sysrq: SysRq : Manual OOM execution
[ 119.046239] kworker/0:8 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
[ 119.048453] kworker/0:8 cpuset=/ mems_allowed=0
(...snipped...)
[ 119.215982] [ 9573] 1000 9573 541717 402522 797 6 0 0 a.out
[ 119.218122] [ 9575] 1000 9574 1078 0 7 3 0 0 a.out
[ 119.220237] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child
[ 119.222129] Killed process 9575 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 120.179644] sysrq: SysRq : Manual OOM execution
[ 120.206938] kworker/0:8 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
[ 120.209152] kworker/0:8 cpuset=/ mems_allowed=0
(...snipped...)
[ 120.376821] [ 9573] 1000 9573 541717 402522 797 6 0 0 a.out
[ 120.378924] [ 9575] 1000 9574 1078 0 7 3 0 0 a.out
[ 120.381065] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child
[ 120.382929] Killed process 9575 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 121.235296] sysrq: SysRq : Manual OOM execution
[ 121.252742] kworker/0:8 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
[ 121.254955] kworker/0:8 cpuset=/ mems_allowed=0
(...snipped...)
[ 141.024984] a.out D ffff88007c417948 0 9573 8117 0x00000080
[ 141.026830] ffff88007c417948 ffff880076cac2c0 ffff880076c442c0 ffff88007c418000
[ 141.028789] ffff88007c417980 ffff88007fc90240 00000000fffd7aa1 00000000000006bc
[ 141.030746] ffff88007c417960 ffffffff816fc1a7 ffff88007fc90240 ffff88007c417a08
[ 141.032703] Call Trace:
[ 141.033653] [<ffffffff816fc1a7>] schedule+0x37/0x90
[ 141.035056] [<ffffffff81700567>] schedule_timeout+0x117/0x1c0
[ 141.036629] [<ffffffff810e1310>] ? init_timer_key+0x40/0x40
[ 141.038182] [<ffffffff81700694>] schedule_timeout_uninterruptible+0x24/0x30
[ 141.039963] [<ffffffff8114944b>] __alloc_pages_nodemask+0x91b/0xd90
[ 141.041631] [<ffffffff811925e6>] alloc_pages_vma+0xb6/0x290
[ 141.043173] [<ffffffff811711d0>] handle_mm_fault+0x1180/0x1630
[ 141.044770] [<ffffffff811700a4>] ? handle_mm_fault+0x54/0x1630
[ 141.046355] [<ffffffff8105a651>] __do_page_fault+0x1a1/0x440
[ 141.047915] [<ffffffff8105a920>] do_page_fault+0x30/0x80
[ 141.049408] [<ffffffff81702307>] ? native_iret+0x7/0x7
[ 141.050876] [<ffffffff817033e8>] page_fault+0x28/0x30
[ 141.052327] [<ffffffff813a6f3d>] ? __clear_user+0x3d/0x70
[ 141.053831] [<ffffffff813ab9e8>] iov_iter_zero+0x68/0x250
[ 141.055346] [<ffffffff814866a8>] read_iter_zero+0x38/0xb0
[ 141.056854] [<ffffffff811c0994>] __vfs_read+0xc4/0xf0
[ 141.058295] [<ffffffff811c154a>] vfs_read+0x7a/0x120
[ 141.059711] [<ffffffff811c1df3>] SyS_read+0x53/0xd0
[ 141.061104] [<ffffffff81701772>] entry_SYSCALL_64_fastpath+0x12/0x76
[ 141.062768] a.out x ffff88007b92fca0 0 9574 9573 0x00000084
[ 141.064604] ffff88007b92fca0 ffff880076cac2c0 ffff88007a862c80 ffff88007b930000
[ 141.066555] ffff88007a863040 ffff88007a863308 ffff88007a862c80 ffff88007cc10000
[ 141.068492] ffff88007b92fcb8 ffffffff816fc1a7 ffff88007a863308 ffff88007b92fd28
[ 141.070437] Call Trace:
[ 141.071389] [<ffffffff816fc1a7>] schedule+0x37/0x90
[ 141.072788] [<ffffffff810733fe>] do_exit+0x6be/0xb50
[ 141.074198] [<ffffffff81073917>] do_group_exit+0x47/0xc0
[ 141.075676] [<ffffffff8107f122>] get_signal+0x222/0x7e0
[ 141.077135] [<ffffffff8100f232>] do_signal+0x32/0x6d0
[ 141.078570] [<ffffffff81095cc8>] ? finish_task_switch+0xa8/0x2b0
[ 141.080176] [<ffffffff8106b967>] ? syscall_slow_exit_work+0x4b/0x10d
[ 141.081837] [<ffffffff81095cc8>] ? finish_task_switch+0xa8/0x2b0
[ 141.083441] [<ffffffff8106b8ba>] ? exit_to_usermode_loop+0x2e/0x90
[ 141.085063] [<ffffffff8106b8d8>] exit_to_usermode_loop+0x4c/0x90
[ 141.086667] [<ffffffff8100355b>] syscall_return_slowpath+0xbb/0x130
[ 141.088305] [<ffffffff817018da>] int_ret_from_sys_call+0x25/0x9f
[ 141.089896] a.out D ffff88007be2fab8 0 9575 9573 0x00100084
[ 141.091734] ffff88007be2fab8 ffff880036509640 ffff8800366742c0 ffff88007be30000
[ 141.093688] 0000000000000000 7fffffffffffffff ffff88007ff72cb8 ffffffff816fca00
[ 141.095743] ffff88007be2fad0 ffffffff816fc1a7 ffff88007fc17280 ffff88007be2fb70
[ 141.097699] Call Trace:
[ 141.098649] [<ffffffff816fca00>] ? bit_wait+0x60/0x60
[ 141.100071] [<ffffffff816fc1a7>] schedule+0x37/0x90
[ 141.101453] [<ffffffff817005c8>] schedule_timeout+0x178/0x1c0
[ 141.103001] [<ffffffff810e81e2>] ? ktime_get+0x102/0x130
[ 141.104468] [<ffffffff810bdfd9>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[ 141.106158] [<ffffffff810be0ad>] ? trace_hardirqs_on+0xd/0x10
[ 141.107698] [<ffffffff810e8187>] ? ktime_get+0xa7/0x130
[ 141.109138] [<ffffffff811276ea>] ? __delayacct_blkio_start+0x1a/0x30
[ 141.110782] [<ffffffff816fb641>] io_schedule_timeout+0xa1/0x110
[ 141.112350] [<ffffffff816fca16>] bit_wait_io+0x16/0x70
[ 141.113774] [<ffffffff816fc62b>] __wait_on_bit+0x5b/0x90
[ 141.115234] [<ffffffff8113f83a>] ? find_get_pages_tag+0x19a/0x2c0
[ 141.116824] [<ffffffff8113e5c6>] wait_on_page_bit+0xc6/0xf0
[ 141.118319] [<ffffffff810b5830>] ? autoremove_wake_function+0x30/0x30
[ 141.119983] [<ffffffff8113e797>] __filemap_fdatawait_range+0x107/0x190
[ 141.121643] [<ffffffff81140a8c>] ? __filemap_fdatawrite_range+0xcc/0x100
[ 141.123352] [<ffffffff8113e82f>] filemap_fdatawait_range+0xf/0x30
[ 141.124955] [<ffffffff81140bad>] filemap_write_and_wait_range+0x3d/0x60
[ 141.126655] [<ffffffff812b2614>] xfs_file_fsync+0x44/0x180
[ 141.128149] [<ffffffff811f482b>] vfs_fsync_range+0x3b/0xb0
[ 141.129646] [<ffffffff812b4242>] xfs_file_write_iter+0x102/0x140
[ 141.131260] [<ffffffff811c0a87>] __vfs_write+0xc7/0x100
[ 141.132702] [<ffffffff811c168d>] vfs_write+0x9d/0x190
[ 141.134108] [<ffffffff811e104a>] ? __fget_light+0x6a/0x90
[ 141.135593] [<ffffffff811c1ec3>] SyS_write+0x53/0xd0
[ 141.136998] [<ffffffff81701772>] entry_SYSCALL_64_fastpath+0x12/0x76
[ 141.138646] a.out D ffff88007af4fce8 0 9576 9573 0x00000084
[ 141.140490] ffff88007af4fce8 ffff8800366742c0 ffff880036672c80 ffff88007af50000
[ 141.142415] ffff88007d14a5b0 ffff880036672c80 0000000000000246 00000000ffffffff
[ 141.144331] ffff88007af4fd00 ffffffff816fc1a7 ffff88007d14a5a8 ffff88007af4fd10
[ 141.146308] Call Trace:
[ 141.147261] [<ffffffff816fc1a7>] schedule+0x37/0x90
[ 141.148651] [<ffffffff816fc4d0>] schedule_preempt_disabled+0x10/0x20
[ 141.150326] [<ffffffff816fd31b>] mutex_lock_nested+0x17b/0x3e0
[ 141.151902] [<ffffffff812b3faf>] ? xfs_file_buffered_aio_write+0x5f/0x1f0
[ 141.153647] [<ffffffff812b3faf>] xfs_file_buffered_aio_write+0x5f/0x1f0
[ 141.155397] [<ffffffff812b41c4>] xfs_file_write_iter+0x84/0x140
[ 141.156989] [<ffffffff811c0a87>] __vfs_write+0xc7/0x100
[ 141.158460] [<ffffffff811c168d>] vfs_write+0x9d/0x190
[ 141.159933] [<ffffffff811e104a>] ? __fget_light+0x6a/0x90
[ 141.161417] [<ffffffff811c1ec3>] SyS_write+0x53/0xd0
[ 141.162853] [<ffffffff81701772>] entry_SYSCALL_64_fastpath+0x12/0x76
(...snipped...)
[ 181.154922] [ 9573] 1000 9573 541717 402522 797 6 0 0 a.out
[ 181.157145] [ 9575] 1000 9574 1078 0 7 3 0 0 a.out
[ 181.159265] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child
[ 181.161160] Killed process 9575 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 184.227075] sysrq: SysRq : Kill All Tasks
----------
using linux-next-20160112 without "mm,oom: exclude TIF_MEMDIE processes from
candidates." patch, and reproducer shown below.
----------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>

static int file_writer(void *unused)
{
static char buffer[4096] = { };
const int fd = open("/tmp/file",
O_WRONLY | O_CREAT | O_APPEND | O_SYNC, 0600);
while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer));
return 0;
}

static int memory_consumer(void *unused)
{
const int fd = open("/dev/zero", O_RDONLY);
unsigned long size;
char *buf = NULL;
sleep(1);
unlink("/tmp/file");
for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
char *cp = realloc(buf, size);
if (!cp) {
size >>= 1;
break;
}
buf = cp;
}
read(fd, buf, size); /* Will cause OOM due to overcommit */
return 0;
}

int main(int argc, char *argv[])
{
if (fork() == 0) {
int i;
for (i = 0; i < 10; i++) {
char *cp = malloc(4096);
if (!cp || clone(file_writer, cp + 4096,
CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL) == -1)
break;
}
} else {
memory_consumer(NULL);
}
while (1)
pause();
}
----------

>
> > How can we guarantee that find_lock_task_mm() from oom_kill_process()
> > chooses !TIF_MEMDIE thread when try_to_sacrifice_child() somehow chose
> > !TIF_MEMDIE thread? I think choosing !TIF_MEMDIE thread at
> > find_lock_task_mm() is the simplest way.
>
> find_lock_task_mm chosing TIF_MEMDIE thread shouldn't change anything
> because the whole thread group will go down anyway. If you want to
> guarantee that the sysrq+f never choses a task which has a TIF_MEMDIE
> thread then we would have to check for fatal_signal_pending as well
> AFAIU. Fiddling with find find_lock_task_mm will not help you though
> unless I am missing something.

I do want to guarantee that the SysRq-f (and timeout based next victim
selection) never chooses a process which has a TIF_MEMDIE thread.

I don't like current "oom: clear TIF_MEMDIE after oom_reaper managed to unmap
the address space" patch unless both "mm,oom: exclude TIF_MEMDIE processes from
candidates." patch and "mm,oom: Re-enable OOM killer using timers." patch are
used together. Since your patch covers only likely case, your patch cannot become
alternative to my patches which cover unlikely cases.