[regression?] hang with 2.6.37 on a BTRFS test machine

From: Martin Steigerwald
Date: Sun Jan 23 2011 - 07:07:21 EST


Not reported on bugzilla.kernel.org yet as its so vague.


Hi!

I have something I do not have the slightest clue on how to track down:

Occasionally with 2.6.37 plain vanilla my ThinkPad T23 freezes in such a
way, that when I type a command at the prompt, nothing will happen
anymore. Same goes for login into a tty. Even dmesg is not executed
anymore. My KDE session doesn't react to clicks anymore, except for window
movement. Most of the time the mouse pointer is still movable, but
sometimes it is frozen as well. But the latter might be a different problem
with X.org.

My theory is that at least I/O related kernel functions do not return
anymore at that time.

Nothing is in the logs after a reboot.

Problem now is: This happens rather irregularily. That kernel run for days
without problems. But today two times it hung after the first time it hung.
One time it was on TTY 1 instead of in X.org. So when this happens it
seems to persist a while.

But I now booted with 2.6.36 and the machine is as stable as if nothing
has happened at all.

OTOH a 2.6.37 with recursive mtime patches from Jan Kara - that hung on
the T23 as well - runs just fine on my T42 for nine days (minus hibernation
times) already.

Now it could be something hardware related, the T23 is different from the
T42. Also the T23 has a fan that should be replaced or oiled - but it
still works right up to now. Also with 2.6.36 or earlier kernels the
machine never hung that way so far.

Or it could be something BTRFS related. The T23 is using BTRFS on /home
for months already and for / since beginning this year or end of last year
while the T42 is still with Ext4 - until there is a robust fsck for BTRFS
and BTRFS is marked stable in the kernel at last.

A bisection is too much effort for me - especially as it could take days
till the problems shows up. Some sysrq stuff doesn't work without serial or
netconsole since a call to dmesg does not return anymore when it happens.
Having netconsole and serial running for days for just in case it happens
is also quite some effort, since I have to remember to reinitialize it
after each hibernation cycle to have a permanent setup.

Since I have no other idea on how to track this with reasonable effort - I
spent tons of time with trying to help debugging various, partly
intermittent, kernel problems in the last months and I have enough of that
ATM - I will stick to 2.6.36 for the time being and see whether the T23 is
really stable with that one. I still wanted to report my - admittedly
vague - findings in case it rings a bell for someone. I will monitor
2.6.37.y updates in case they contain BTRFS and or T23 hardware related
fixes.

Last but not least: I have no seen BTRFS related errors or other oddities
in syslog. The following looks quite normal to me, I did a btrfs filesystem
balance some time ago and had some hard switch off cold reboots due to the
issue mentioned:

artin@deepdance:~> zgrep -i btrfs /var/log/syslog*
/var/log/syslog.1:Jan 17 20:43:06 localhost kernel: Btrfs loaded
/var/log/syslog.1:Jan 23 12:01:12 localhost kernel: Btrfs loaded
/var/log/syslog.1:Jan 23 12:01:12 localhost kernel: btrfs: unlinked 3
orphans
/var/log/syslog.1:Jan 23 12:09:36 localhost kernel: Btrfs loaded
/var/log/syslog.1:Jan 23 12:09:36 localhost kernel: btrfs: unlinked 2
orphans
/var/log/syslog.1:Jan 23 12:20:52 localhost kernel: Btrfs loaded
/var/log/syslog.1:Jan 23 12:20:52 localhost kernel: btrfs: unlinked 18
orphans
/var/log/syslog.2.gz:Jan 14 12:32:16 localhost kernel: Btrfs loaded
/var/log/syslog.2.gz:Jan 14 12:32:16 localhost kernel: btrfs: unlinked 116
orphans
/var/log/syslog.2.gz:Jan 14 12:40:02 localhost kernel: Btrfs loaded
/var/log/syslog.2.gz:Jan 14 14:44:28 localhost kernel: Btrfs loaded
/var/log/syslog.2.gz:Jan 16 18:14:06 localhost kernel: Btrfs loaded
/var/log/syslog.2.gz:Jan 16 18:14:06 localhost kernel: btrfs: unlinked 2
orphans
/var/log/syslog.2.gz:Jan 16 18:26:46 localhost kernel: Btrfs loaded
/var/log/syslog.2.gz:Jan 16 18:26:46 localhost kernel: btrfs: unlinked 2
orphans
/var/log/syslog.2.gz:Jan 16 18:41:11 localhost kernel: Btrfs loaded
/var/log/syslog.2.gz:Jan 16 18:41:11 localhost kernel: btrfs: unlinked 3
orphans
/var/log/syslog.2.gz:Jan 16 18:53:47 localhost kernel: Btrfs loaded
/var/log/syslog.2.gz:Jan 16 18:53:47 localhost kernel: btrfs: unlinked 2
orphans
/var/log/syslog.2.gz:Jan 16 19:00:14 localhost kernel: Btrfs loaded
/var/log/syslog.2.gz:Jan 16 19:00:14 localhost kernel: btrfs: unlinked 2
orphans
/var/log/syslog.2.gz:Jan 16 19:05:53 localhost kernel: Btrfs loaded
/var/log/syslog.2.gz:Jan 16 19:05:54 localhost kernel: btrfs: unlinked 2
orphans
/var/log/syslog.7.gz:Dec 9 19:05:38 localhost kernel: Btrfs loaded
/var/log/syslog.7.gz:Dec 9 19:05:38 localhost kernel: btrfs: unlinked 2
orphans
/var/log/syslog.7.gz:Dec 9 19:05:38 localhost kernel: btrfs: truncated 1
orphans
/var/log/syslog.7.gz:Dec 9 20:54:10 localhost kernel: btrfs: relocating
block group 10900996096 flags 36
/var/log/syslog.7.gz:Dec 9 20:55:02 localhost kernel: btrfs: found 2036
extents
/var/log/syslog.7.gz:Dec 9 20:55:02 localhost kernel: btrfs: relocating
block group 10766778368 flags 36
/var/log/syslog.7.gz:Dec 9 20:56:15 localhost kernel: btrfs: found 10432
extents
/var/log/syslog.7.gz:Dec 9 20:56:16 localhost kernel: btrfs: relocating
block group 9693036544 flags 1
/var/log/syslog.7.gz:Dec 9 20:56:26 localhost kernel: btrfs: found 155
extents
/var/log/syslog.7.gz:Dec 9 20:56:31 localhost kernel: btrfs: found 154
extents
/var/log/syslog.7.gz:Dec 9 20:56:31 localhost kernel: btrfs: relocating
block group 8619294720 flags 1
/var/log/syslog.7.gz:Dec 9 20:57:37 localhost kernel: btrfs: found 5027
extents
/var/log/syslog.7.gz:Dec 9 20:57:49 localhost kernel: btrfs: found 5027
extents
/var/log/syslog.7.gz:Dec 9 20:57:51 localhost kernel: btrfs: relocating
block group 7545552896 flags 1
/var/log/syslog.7.gz:Dec 9 20:59:48 localhost kernel: btrfs: found 12851
extents
/var/log/syslog.7.gz:Dec 9 21:00:28 localhost kernel: btrfs: found 12846
extents
/var/log/syslog.7.gz:Dec 9 21:00:29 localhost kernel: btrfs: relocating
block group 6471811072 flags 1
/var/log/syslog.7.gz:Dec 9 21:01:59 localhost kernel: btrfs: found 46780
extents
/var/log/syslog.7.gz:Dec 9 21:03:21 localhost kernel: btrfs: found 46780
extents
/var/log/syslog.7.gz:Dec 9 21:03:26 localhost kernel: btrfs: relocating
block group 5398069248 flags 1
/var/log/syslog.7.gz:Dec 9 21:04:34 localhost kernel: btrfs: found 19425
extents
/var/log/syslog.7.gz:Dec 9 21:05:15 localhost kernel: btrfs: found 19425
extents
/var/log/syslog.7.gz:Dec 9 21:05:16 localhost kernel: btrfs: relocating
block group 4324327424 flags 1
/var/log/syslog.7.gz:Dec 9 21:06:33 localhost kernel: btrfs: found 26059
extents
/var/log/syslog.7.gz:Dec 9 21:07:22 localhost kernel: btrfs: found 26059
extents
/var/log/syslog.7.gz:Dec 9 21:07:25 localhost kernel: btrfs: relocating
block group 3250585600 flags 1
/var/log/syslog.7.gz:Dec 9 21:08:47 localhost kernel: btrfs: found 15614
extents
/var/log/syslog.7.gz:Dec 9 21:09:29 localhost kernel: btrfs: found 15614
extents
/var/log/syslog.7.gz:Dec 9 21:09:31 localhost kernel: btrfs: relocating
block group 2176843776 flags 1
/var/log/syslog.7.gz:Dec 9 21:10:58 localhost kernel: btrfs: found 4275
extents
/var/log/syslog.7.gz:Dec 9 21:11:34 localhost kernel: btrfs: found 4275
extents
/var/log/syslog.7.gz:Dec 9 21:11:35 localhost kernel: btrfs: relocating
block group 1103101952 flags 1
/var/log/syslog.7.gz:Dec 9 21:13:01 localhost kernel: btrfs: found 7296
extents
/var/log/syslog.7.gz:Dec 9 21:13:35 localhost kernel: btrfs: found 7296
extents
/var/log/syslog.7.gz:Dec 9 21:13:36 localhost kernel: btrfs: relocating
block group 29360128 flags 36
/var/log/syslog.7.gz:Dec 9 21:16:53 localhost kernel: btrfs: found 78308
extents
/var/log/syslog.7.gz:Dec 9 21:16:54 localhost kernel: btrfs: relocating
block group 20971520 flags 34
/var/log/syslog.7.gz:Dec 9 21:16:54 localhost kernel: btrfs: found 1
extents
/var/log/syslog.7.gz:Dec 9 21:16:56 localhost kernel: btrfs: relocating
block group 12582912 flags 1
/var/log/syslog.7.gz:Dec 9 21:16:57 localhost kernel: btrfs: found 85
extents
/var/log/syslog.7.gz:Dec 9 21:16:59 localhost kernel: btrfs: found 85
extents
/var/log/syslog.7.gz:Dec 9 21:17:01 localhost kernel: btrfs: relocating
block group 4194304 flags 4

Hmmm, one thing there is:

Jan 13 13:27:15 localhost kernel: INFO: task dpkg:21162 blocked for more
than 120 seconds.
Jan 13 13:27:15 localhost kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 13 13:27:15 localhost kernel: dpkg D ef891b0c 0 21162
2334 0x00000000
Jan 13 13:27:15 localhost kernel: d2e4be7c 00000082 ef8918cc ef891b0c
ef8918a0 ef891b0c ef891b10 1b8b88c5
Jan 13 13:27:15 localhost kernel: 00018796 ef8918a0 7fffffff d2e4bf4c 7fffffff
d2e4bee4 c13c4495 cb7862ac
Jan 13 13:27:15 localhost kernel: d2e4be98 c102ab56 cb7862ac c1514fa0
d2e4beb4 c102d740 ee303140 d2e4bec0
Jan 13 13:27:15 localhost kernel: Call Trace:
Jan 13 13:27:15 localhost kernel: [<c13c4495>]
schedule_timeout+0x175/0x280
Jan 13 13:27:15 localhost kernel: [<c102ab56>] ?
__dequeue_entity+0x26/0x50
Jan 13 13:27:15 localhost kernel: [<c102d740>] ? set_next_entity+0xc0/0xe0
Jan 13 13:27:15 localhost kernel: [<c10017b3>] ? __switch_to+0xa3/0x160
Jan 13 13:27:15 localhost kernel: [<c1030132>] ? T.1277+0x32/0xa0
Jan 13 13:27:15 localhost kernel: [<c13c394c>] ? schedule+0x1fc/0x5d0
Jan 13 13:27:15 localhost kernel: [<c13c4006>] wait_for_common+0xa6/0x160
Jan 13 13:27:15 localhost kernel: [<c10316a0>] ?
default_wake_function+0x0/0x20
Jan 13 13:27:15 localhost kernel: [<c13c4197>]
wait_for_completion+0x17/0x20
Jan 13 13:27:15 localhost kernel: [<c10f8d52>] sync_inodes_sb+0x92/0x1a0
Jan 13 13:27:15 localhost kernel: [<c111f230>] ?
dquot_quota_sync+0x0/0x310
Jan 13 13:27:15 localhost kernel: [<c10fd4f7>] __sync_filesystem+0x87/0xa0
Jan 13 13:27:15 localhost kernel: [<c10fd52f>] sync_one_sb+0x1f/0x30
Jan 13 13:27:15 localhost kernel: [<c10d9e6b>] iterate_supers+0x7b/0xf0
Jan 13 13:27:15 localhost kernel: [<c10fd510>] ? sync_one_sb+0x0/0x30
Jan 13 13:27:15 localhost kernel: [<c10fd57c>] sys_sync+0x3c/0x60
Jan 13 13:27:15 localhost kernel: [<c1002f23>] sysenter_do_call+0x12/0x2c
Jan 13 13:33:38 localhost init: Trying to re-exec init

There is only one occurance of it:

martin@deepdance:~> zgrep -i "blocked" /var/log/syslog*
/var/log/syslog.2.gz:Jan 13 13:27:15 localhost kernel: INFO: task
dpkg:21162 blocked for more than 120 seconds.
martin@deepdance:~#1>

I put the logs aside in case anyone wants to have a closer look.

Thanks,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7

Attachment: signature.asc
Description: This is a digitally signed message part.