Re: Soft-Lockup/Race in networking in 2.6.31-rc1+195 (possibly causedby netem)

From: Andres Freund
Date: Thu Jul 02 2009 - 07:12:24 EST


On 07/02/2009 12:12 PM, Jarek Poplawski wrote:
On Thu, Jul 02, 2009 at 09:30:31AM +0000, Jarek Poplawski wrote:
On Thu, Jul 02, 2009 at 02:37:24AM +0200, Andres Freund wrote:
...
So I tried - and I did not catch any lockdep output before the crash.
Unfortunately I do not have another machine on the same local network to
catch any messages after the crash... So I could be missing some warning
(I did synchronous logging though).
Will check with netconsole tomorrow.

Could you try if this patch changes anything?

...and maybe CONFIG_PACKET_MMAP turned off.
Ok. Removed the skb_orphan and turned of CONFIG_PACKET_MMAP. Seemingly the same game.
I now had another computer to catch the netconsole output. Still no lockdep warnings.

Unfortunately the other computer was a windows machine with its strange terminal, so long lines are wrapped at 80cols, but that shouldn't be too bad.

Andres

[ 215.208044] netem: version 1.2

[ 350.040136] BUG: soft lockup - CPU#1 stuck for 61s! [openvpn:4248]
[ 350.040136] Modules linked in: sch_netem sbs sbshc pcmcia snd_hda_codec_conex
ant yenta_socket rsrc_nonstatic snd_hda_intel snd_hda_codec thinkpad_acpi iwlagn
pcmcia_core btusb snd_hwdep ehci_hcd uhci_hcd
[ 350.040136] irq event stamp: 149925
[ 350.040136] hardirqs last enabled at (149924): [<ffffffff81036a10>] restore_
args+0x0/0x30
[ 350.040136] hardirqs last disabled at (149925): [<ffffffff81035d3a>] save_arg
s+0x6a/0x70
[ 350.040136] softirqs last enabled at (19946): [<ffffffff815528ad>] lock_sock
_nested+0x8d/0x130
[ 350.040136] softirqs last disabled at (19952): [<ffffffff815627a8>] dev_queue
_xmit+0x58/0x4b0
[ 350.040136] CPU 1:
[ 350.040136] Modules linked in: sch_netem sbs sbshc pcmcia snd_hda_codec_conex
ant yenta_socket rsrc_nonstatic snd_hda_intel snd_hda_codec thinkpad_acpi iwlagn
pcmcia_core btusb snd_hwdep ehci_hcd uhci_hcd
[ 350.040136] Pid: 4248, comm: openvpn Not tainted 2.6.31-rc1-andres-00437-gde7
327a-dirty #61 208252G
[ 350.040136] RIP: 0010:[<ffffffff8103e276>] [<ffffffff8103e276>] native_read_
tsc+0x6/0x20
[ 350.040136] RSP: 0018:ffff8801271c78b8 EFLAGS: 00000206
[ 350.040136] RAX: 000000000ae20bd8 RBX: ffff8801271c78b8 RCX: 000000000ae20b00

[ 350.040136] RDX: 00000000000000b0 RSI: 0000000000006040 RDI: 0000000000000001

[ 350.040136] RBP: ffffffff81036b6e R08: ffffffff82175180 R09: 0000000000000000

[ 350.040136] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801271c6000

[ 350.040136] R13: 0000000000000000 R14: ffff88002efec400 R15: 0000000000000000

[ 350.040136] FS: 00007f98ddc646f0(0000) GS:ffff88002efde000(0000) knlGS:00000
00000000000
[ 350.040136] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 350.040136] CR2: 0000000005593008 CR3: 000000012742f000 CR4: 00000000000026e0

[ 350.040136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

[ 350.040136] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

[ 350.040136] Call Trace:
[ 350.040136] [<ffffffff812a96c2>] ? delay_tsc+0x22/0x80
[ 350.040136] [<ffffffff812a95da>] ? __delay+0xa/0x10
[ 350.040136] [<ffffffff812addbd>] ? _raw_spin_lock+0xfd/0x170
[ 350.040136] [<ffffffff816e92f1>] ? _spin_lock+0x51/0x70
[ 350.040136] [<ffffffff81562836>] ? dev_queue_xmit+0xe6/0x4b0
[ 350.040136] [<ffffffff81562836>] ? dev_queue_xmit+0xe6/0x4b0
[ 350.040136] [<ffffffff815627a3>] ? dev_queue_xmit+0x53/0x4b0
[ 350.040136] [<ffffffff81594bac>] ? ip_finish_output+0x13c/0x320
[ 350.040136] [<ffffffff81594e0b>] ? ip_output+0x7b/0xd0
[ 350.040136] [<ffffffff81593bf0>] ? ip_local_out+0x20/0x30
[ 350.040136] [<ffffffff815943c5>] ? ip_queue_xmit+0x165/0x3b0
[ 350.040136] [<ffffffff815a8d49>] ? tcp_transmit_skb+0x3e9/0x780
[ 350.040136] [<ffffffff815ab3b7>] ? tcp_write_xmit+0x1e7/0x9d0
[ 350.040136] [<ffffffff815abc0b>] ? __tcp_push_pending_frames+0x2b/0xd0
[ 350.040136] [<ffffffff8159e327>] ? tcp_sendmsg+0x887/0xb90
[ 350.040136] [<ffffffff8154fa86>] ? sock_sendmsg+0x126/0x140
[ 350.040136] [<ffffffff81097b60>] ? autoremove_wake_function+0x0/0x40
[ 350.040136] [<ffffffff81097b60>] ? autoremove_wake_function+0x0/0x40
[ 350.040136] [<ffffffff810ab0e7>] ? mark_held_locks+0x67/0x90
[ 350.040136] [<ffffffff816e90bb>] ? _spin_unlock_irqrestore+0x3b/0x70
[ 350.040136] [<ffffffff815509c0>] ? sys_sendto+0xf0/0x130
[ 350.040136] [<ffffffff810ab3fd>] ? trace_hardirqs_on_caller+0x14d/0x190
[ 350.040136] [<ffffffff810ab44d>] ? trace_hardirqs_on+0xd/0x10
[ 350.040136] [<ffffffff810a1c27>] ? getnstimeofday+0x57/0xe0
[ 350.040136] [<ffffffff8109bbf1>] ? ktime_get_ts+0x51/0x70
[ 350.040136] [<ffffffff81035ec2>] ? system_call_fastpath+0x16/0x1b
[ 415.538136] BUG: soft lockup - CPU#1 stuck for 61s! [openvpn:4248]
[ 415.538136] Modules linked in: sch_netem sbs sbshc pcmcia snd_hda_codec_conex
ant yenta_socket rsrc_nonstatic snd_hda_intel snd_hda_codec thinkpad_acpi iwlagn
pcmcia_core btusb snd_hwdep ehci_hcd uhci_hcd
[ 415.538136] irq event stamp: 281051
[ 415.538136] hardirqs last enabled at (281050): [<ffffffff81036a10>] restore_
args+0x0/0x30
[ 415.538136] hardirqs last disabled at (281051): [<ffffffff81035d3a>] save_arg
s+0x6a/0x70
[ 415.538136] softirqs last enabled at (19946): [<ffffffff815528ad>] lock_sock
_nested+0x8d/0x130
[ 415.538136] softirqs last disabled at (19952): [<ffffffff815627a8>] dev_queue
_xmit+0x58/0x4b0
[ 415.538136] CPU 1:
[ 415.538136] Modules linked in: sch_netem sbs sbshc pcmcia snd_hda_codec_conex
ant yenta_socket rsrc_nonstatic snd_hda_intel snd_hda_codec thinkpad_acpi iwlagn
pcmcia_core btusb snd_hwdep ehci_hcd uhci_hcd
[ 415.538136] Pid: 4248, comm: openvpn Not tainted 2.6.31-rc1-andres-00437-gde7
327a-dirty #61 208252G
[ 415.538136] RIP: 0010:[<ffffffff8103e276>] [<ffffffff8103e276>] native_read_
tsc+0x6/0x20
[ 415.538136] RSP: 0018:ffff8801271c78b8 EFLAGS: 00000206
[ 415.538136] RAX: 000000008cf9059c RBX: ffff8801271c78b8 RCX: 000000008cf9050c

[ 415.538136] RDX: 00000000000000d4 RSI: 0000000000006040 RDI: 0000000000000001

[ 415.538136] RBP: ffffffff81036b6e R08: ffffffff82175180 R09: 0000000000000000

[ 415.538136] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000003fed

[ 415.538136] R13: ffff88002efde000 R14: ffff8801271c6000 R15: 0000000000000000

[ 415.538136] FS: 00007f98ddc646f0(0000) GS:ffff88002efde000(0000) knlGS:00000
00000000000
[ 415.538136] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 415.538136] CR2: 0000000005593008 CR3: 000000012742f000 CR4: 00000000000026e0

[ 415.538136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

[ 415.538136] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

[ 415.538136] Call Trace:
[ 415.538136] [<ffffffff812a96ea>] ? delay_tsc+0x4a/0x80
[ 415.538136] [<ffffffff812a95da>] ? __delay+0xa/0x10
[ 415.538136] [<ffffffff812addbd>] ? _raw_spin_lock+0xfd/0x170
[ 415.538136] [<ffffffff816e92f1>] ? _spin_lock+0x51/0x70
[ 415.538136] [<ffffffff81562836>] ? dev_queue_xmit+0xe6/0x4b0
[ 415.538136] [<ffffffff81562836>] ? dev_queue_xmit+0xe6/0x4b0
[ 415.538136] [<ffffffff815627a3>] ? dev_queue_xmit+0x53/0x4b0
[ 415.538136] [<ffffffff81594bac>] ? ip_finish_output+0x13c/0x320
[ 415.538136] [<ffffffff81594e0b>] ? ip_output+0x7b/0xd0
[ 415.538136] [<ffffffff81593bf0>] ? ip_local_out+0x20/0x30
[ 415.538136] [<ffffffff815943c5>] ? ip_queue_xmit+0x165/0x3b0
[ 415.538136] [<ffffffff815a8d49>] ? tcp_transmit_skb+0x3e9/0x780
[ 415.538136] [<ffffffff815ab3b7>] ? tcp_write_xmit+0x1e7/0x9d0
[ 415.538136] [<ffffffff815abc0b>] ? __tcp_push_pending_frames+0x2b/0xd0
[ 415.538136] [<ffffffff8159e327>] ? tcp_sendmsg+0x887/0xb90
[ 415.538136] [<ffffffff8154fa86>] ? sock_sendmsg+0x126/0x140
[ 415.538136] [<ffffffff81097b60>] ? autoremove_wake_function+0x0/0x40
[ 415.538136] [<ffffffff81097b60>] ? autoremove_wake_function+0x0/0x40
[ 415.538136] [<ffffffff810ab0e7>] ? mark_held_locks+0x67/0x90
[ 415.538136] [<ffffffff816e90bb>] ? _spin_unlock_irqrestore+0x3b/0x70
[ 415.538136] [<ffffffff815509c0>] ? sys_sendto+0xf0/0x130
[ 415.538136] [<ffffffff810ab3fd>] ? trace_hardirqs_on_caller+0x14d/0x190
[ 415.538136] [<ffffffff810ab44d>] ? trace_hardirqs_on+0xd/0x10
[ 415.538136] [<ffffffff810a1c27>] ? getnstimeofday+0x57/0xe0
[ 415.538136] [<ffffffff8109bbf1>] ? ktime_get_ts+0x51/0x70
[ 415.538136] [<ffffffff81035ec2>] ? system_call_fastpath+0x16/0x1b