Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM

From: azurIt
Date: Sun Jul 14 2013 - 13:07:38 EST


> CC: "Michal Hocko" <mhocko@xxxxxxx>, linux-kernel@xxxxxxxxxxxxxxx, linux-mm@xxxxxxxxx, "cgroups mailinglist" <cgroups@xxxxxxxxxxxxxxx>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@xxxxxxxxxxxxxx>
>On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote:
>> >I looked at your debug messages but could not find anything that would
>> >hint at a deadlock. All tasks are stuck in the refrigerator, so I
>> >assume you use the freezer cgroup and enabled it somehow?
>>
>>
>> Yes, i'm really using freezer cgroup BUT i was checking if it's not
>> doing problems - unfortunately, several days passed from that day
>> and now i don't fully remember if i was checking it for both cases
>> (unremoveabled cgroups and these freezed processes holding web
>> server port). I'm 100% sure i was checking it for unremoveable
>> cgroups but not so sure for the other problem (i had to act quickly
>> in that case). Are you sure (from stacks) that freezer cgroup was
>> enabled there?
>
>Yeah, all the traces without exception look like this:
>
>1372089762/23433/stack:[<ffffffff81080925>] refrigerator+0x95/0x160
>1372089762/23433/stack:[<ffffffff8106ab7b>] get_signal_to_deliver+0x1cb/0x540
>1372089762/23433/stack:[<ffffffff8100188b>] do_signal+0x6b/0x750
>1372089762/23433/stack:[<ffffffff81001fc5>] do_notify_resume+0x55/0x80
>1372089762/23433/stack:[<ffffffff815cac77>] int_signal+0x12/0x17
>1372089762/23433/stack:[<ffffffffffffffff>] 0xffffffffffffffff
>
>so the freezer was already enabled when you took the backtraces.
>
>> Btw, what about that other stacks? I mean this file:
>> http://watchdog.sk/lkml/memcg-bug-7.tar.gz
>>
>> It was taken while running the kernel with your patch and from
>> cgroup which was under unresolveable OOM (just like my very original
>> problem).
>
>I looked at these traces too, but none of the tasks are stuck in rmdir
>or the OOM path. Some /are/ in the page fault path, but they are
>happily doing reclaim and don't appear to be stuck. So I'm having a
>hard time matching this data to what you otherwise observed.
>
>However, based on what you reported the most likely explanation for
>the continued hangs is the unfinished OOM handling for which I sent
>the followup patch for arch/x86/mm/fault.c.


Johannes,

this problem happened again but was even worse, now i'm sure it wasn't my fault. This time I even wasn't able to access /proc/<pid> of hanged apache process (which was, again, helding web server port and forced me to reboot the server). Everything which tried to access /proc/<pid> just hanged. Server even wasn't able to reboot correctly, it hanged and then done a hard reboot after few minutes.

azur
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/