Re: [PATCH -mm] memcg: do not trigger OOM fromadd_to_page_cache_locked

From: Johannes Weiner
Date: Mon Nov 26 2012 - 13:24:21 EST


On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote:
> On Mon 26-11-12 12:46:22, Johannes Weiner wrote:
> > On Mon, Nov 26, 2012 at 02:18:37PM +0100, Michal Hocko wrote:
> > > [CCing also Johannes - the thread started here:
> > > https://lkml.org/lkml/2012/11/21/497]
> > >
> > > On Mon 26-11-12 01:38:55, azurIt wrote:
> > > > >This is hackish but it should help you in this case. Kamezawa, what do
> > > > >you think about that? Should we generalize this and prepare something
> > > > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY
> > > > >automatically and use the function whenever we are in a locked context?
> > > > >To be honest I do not like this very much but nothing more sensible
> > > > >(without touching non-memcg paths) comes to my mind.
> > > >
> > > >
> > > > I installed kernel with this patch, will report back if problem occurs
> > > > again OR in few weeks if everything will be ok. Thank you!
> > >
> > > Now that I am looking at the patch closer it will not work because it
> > > depends on other patch which is not merged yet and even that one would
> > > help on its own because __GFP_NORETRY doesn't break the charge loop.
> > > Sorry I have missed that...
> > >
> > > The patch bellow should help though. (it is based on top of the current
> > > -mm tree but I will send a backport to 3.2 in the reply as well)
> > > ---
> > > >From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001
> > > From: Michal Hocko <mhocko@xxxxxxx>
> > > Date: Mon, 26 Nov 2012 11:47:57 +0100
> > > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked
> > >
> > > memcg oom killer might deadlock if the process which falls down to
> > > mem_cgroup_handle_oom holds a lock which prevents other task to
> > > terminate because it is blocked on the very same lock.
> > > This can happen when a write system call needs to allocate a page but
> > > the allocation hits the memcg hard limit and there is nothing to reclaim
> > > (e.g. there is no swap or swap limit is hit as well and all cache pages
> > > have been reclaimed already) and the process selected by memcg OOM
> > > killer is blocked on i_mutex on the same inode (e.g. truncate it).
> > >
> > > Process A
> > > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex
> > > [<ffffffff81121c90>] do_last+0x250/0xa30
> > > [<ffffffff81122547>] path_openat+0xd7/0x440
> > > [<ffffffff811229c9>] do_filp_open+0x49/0xa0
> > > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240
> > > [<ffffffff8110f950>] sys_open+0x20/0x30
> > > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> > > [<ffffffffffffffff>] 0xffffffffffffffff
> > >
> > > Process B
> > > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
> > > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
> > > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
> > > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
> > > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
> > > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
> > > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270
> > > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
> > > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
> > > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex
> > > [<ffffffff8111156a>] do_sync_write+0xea/0x130
> > > [<ffffffff81112183>] vfs_write+0xf3/0x1f0
> > > [<ffffffff81112381>] sys_write+0x51/0x90
> > > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> > > [<ffffffffffffffff>] 0xffffffffffffffff
> >
> > So process B manages to lock the hierarchy, calls
> > mem_cgroup_out_of_memory() and retries the charge infinitely, waiting
> > for task A to die. All while it holds the i_mutex, preventing task A
> > from dying, right?
>
> Right.
>
> > I think global oom already handles this in a much better way: invoke
> > the OOM killer, sleep for a second, then return to userspace to
> > relinquish all kernel resources and locks. The only reason why we
> > can't simply change from an endless retry loop is because we don't
> > want to return VM_FAULT_OOM and invoke the global OOM killer.
>
> Exactly.
>
> > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and
> > just restart the pagefault. Return -ENOMEM to the buffered IO syscall
> > respectively. This way, the memcg OOM killer is invoked as it should
> > but nobody gets stuck anywhere livelocking with the exiting task.
>
> Hmm, we would still have a problem with oom disabled (aka user space OOM
> killer), right? All processes but those in mem_cgroup_handle_oom are
> risky to be killed.

Could we still let everybody get stuck in there when the OOM killer is
disabled and let userspace take care of it?

> Other POV might be, why we should trigger an OOM killer from those paths
> in the first place. Write or read (or even readahead) are all calls that
> should rather fail than cause an OOM killer in my opinion.

Readahead is arguable, but we kill globally for read() and write() and
I think we should do the same for memcg.

The OOM killer is there to resolve a problem that comes from
overcommitting the machine but the overuse does not have to be from
the application that pushes the machine over the edge, that's why we
don't just kill the allocating task but actually go look for the best
candidate. If you have one memory hog that overuses the resources,
attempted memory consumption in a different program should invoke the
OOM killer. It does not matter if this is a page fault (would still
happen with your patch) or a bufferd read/write (would no longer
happen).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/