Re: Strange freeze on 2.6.22 (deadlock?)
From: Randy Dunlap
Date: Mon Jan 07 2008 - 12:22:41 EST
On Mon, 07 Jan 2008 16:48:54 +0100 Brice Figureau wrote:
> Hi,
>
> I'm seeing a strange complete server freeze/lock-up on an bi-Xeon HT
> amd64 server running standard debian 2.6.22 (and before that vanilla
> 2.6.19.x and 2.6.20.x which exhibited the same issue).
>
> I'm only reporting it now, since I could get a full sysrq-t only this
> morning.
>
> The symptoms are that every 5 to 7 days, the server (which acts as a MX
> along with a few low traffic websites) locks-up. The ipmi watchdog is
> unable to reboot the server (and doesn't even trigger, since there is no
> evidence in the esmlog), the machine is still pingable. I can't ssh to
> it, but I can enter my login & password on a serial console, but no
> shell is started.
>
> Pressing sysrq-t produced the trace hosted here:
> http://www.daysofwonder.com/bug/crash-server1.txt.gz
>
> It happened one time when I was connected to the server through ssh and
> I could see that the load started to increase well above 100. It was
> then impossible to launch new process from the command-line (and I had
> to reboot manually).
> It happened also last week, and the server was stuck for about 6 hours.
> When I started investigating what was wrong, it slowly came back to life
> (with an avg 1-min load of more than 1500, and tons of cron processes
> running in parallel).
>
> I'm not really familiar with kernel development so I can't really find
> the issue in the aforementioned trace output.
> What I think is that for some reason there is a race/deadlock that
> finally prevents new processes to really start (which in turns produces
> the high load).
>
> What seems suspect in the aforementioned trace is:
> *) lot of processes stacktrace ends in __mod_timer+0xc3/0xd3
> which seems to be this line from kernel/timer.c
>
> 415 timer->expires = expires;
> 416 internal_add_timer(base, timer);
> --> spin_unlock_irqrestore(&base->lock, flags);
>
> 419 return ret;
> 420 }
>
> *) lot of processes stacktrace ends in __mutex_lock_slowpath and/or zone_statistics
There are also lots of processes in D state (usually waiting
for I/O to complete). And jbd is in their stack traces.
How is/are the ext3 filesystems mounted? I mean what data=xyz
mode? data=journal (the heaviest duty mode) has at least one
known deadlock. If you are using data=journal, you could try
switching to data=ordered...
> Anyway, I will soon reboot to a 2.6.23.x to see if that symptom
> persists.
> More information (config, server specs) are available on request.
>
> I'm not subscribed to the list, so please CC: me for any anwser.
> Many thanks,
---
~Randy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/