4.1.16 crash: list_add corruption

From: Philipp Hahn
Date: Wed Sep 28 2016 - 13:10:51 EST


Hello,
one of our servers crashed repeatedly this week. After setting up serial
console logging we were able to capture the following stack traces:

> [3689736.061539] WARNING: CPU: 0 PID: 29284 at linux-4.1.6/lib/list_debug.c:33 __list_add+0xc0/0xd0()
> [3689736.061541] list_add corruption. prev->next should be next (ffffffff81ab3ca8), but was ffffffff81ab3cc8. (prev=ffff8804d9910d58).

Compare this ...

> [3689736.061602] CPU: 0 PID: 29284 Comm: slapd Tainted: G W 4.1.0-ucs190-amd64 #1 Debian 4.1.6-1.190.201604142226
> [3689736.061603] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015

Maybe VMware has a bug?

> [3689736.061604] 0000000000000000 ffffffff817531c0 ffffffff81597807 ffff88083fc038a8
> [3689736.061606] ffffffff81076c45 ffff88004b553e00 ffffffff81ab3ca8 ffff8804d9910d58
> [3689736.061608] 0000000000000001 0000000137090762 ffffffff81076d4a ffffffff81753310
> [3689736.061609] Call Trace:
...
> [3689736.061624] [<ffffffff8130be50>] ? __list_add+0xc0/0xd0
> [3689736.061627] [<ffffffff810da5a6>] ? internal_add_timer+0x36/0xa0
> [3689736.061629] [<ffffffff810dc6fa>] ? mod_timer_pending+0xfa/0x140
> [3689736.061635] [<ffffffffa048c441>] ? __nf_ct_refresh_acct+0xb1/0xc0 [nf_conntrack]
> [3689736.061640] [<ffffffffa04945bc>] ? tcp_packet+0x66c/0x1500 [nf_conntrack]
> [3689736.061643] [<ffffffff810b5fff>] ? autoremove_wake_function+0x2f/0x50
> [3689736.061647] [<ffffffffa0493ef2>] ? tcp_error+0x1b2/0x210 [nf_conntrack]
> [3689736.061650] [<ffffffffa048e725>] ? nf_conntrack_in+0x3a5/0xb30 [nf_conntrack]
> [3689736.061654] [<ffffffff81481cb4>] ? sk_reset_timer+0x14/0x20
> [3689736.061657] [<ffffffff814cdeef>] ? nf_iterate+0x4f/0x80
> [3689736.061659] [<ffffffff814cdfb8>] ? nf_hook_slow+0x98/0xf0
> [3689736.061662] [<ffffffff814d52f4>] ? ip_rcv+0x314/0x400
> [3689736.061664] [<ffffffff814d48a0>] ? inet_add_protocol+0x50/0x50
> [3689736.061668] [<ffffffff81498ae3>] ? __netif_receive_skb_core+0x703/0x920
> [3689736.061670] [<ffffffff8101f405>] ? read_tsc+0x5/0x10
> [3689736.061672] [<ffffffff81498ecf>] ? netif_receive_skb_internal+0x1f/0x90
> [3689736.061673] [<ffffffff81499af0>] ? napi_gro_receive+0xb0/0xe0
> [3689736.061678] [<ffffffffa0097fe4>] ? e1000_clean_rx_irq+0x2b4/0x500 [e1000]
> [3689736.061681] [<ffffffffa0099ccc>] ? e1000_clean+0x26c/0x900 [e1000]
> [3689736.061683] [<ffffffff81499629>] ? net_rx_action+0x159/0x330
> [3689736.061685] [<ffffffff8107aace>] ? __do_softirq+0xde/0x260
> [3689736.061687] [<ffffffff8107ae95>] ? irq_exit+0x95/0xa0
> [3689736.061689] [<ffffffff815a0b74>] ? do_IRQ+0x64/0x110
> [3689736.061691] [<ffffffff8159e9ee>] ? common_interrupt+0x6e/0x6e
...
> [3689738.157677] WARNING: CPU: 0 PID: 29284 at linux-4.1.6/lib/list_debug.c:33 __list_add+0xc0/0xd0()
> [3689738.157678] list_add corruption. prev->next should be next (ffffffff81ab3cc8), but was ffffffff81ab3ca8. (prev=ffff8804d9910d58).

with that one: the arguments are swapped.

...
> [3689738.157740] [<ffffffff8130be50>] ? __list_add+0xc0/0xd0
> [3689738.157742] [<ffffffff810da5a6>] ? internal_add_timer+0x36/0xa0
> [3689738.157744] [<ffffffff810dc6fa>] ? mod_timer_pending+0xfa/0x140
> [3689738.157748] [<ffffffffa048c441>] ? __nf_ct_refresh_acct+0xb1/0xc0 [nf_conntrack]
> [3689738.157751] [<ffffffffa04945bc>] ? tcp_packet+0x66c/0x1500 [nf_conntrack]
> [3689738.157753] [<ffffffff8101f9d5>] ? sched_clock+0x5/0x10
> [3689738.157755] [<ffffffff8109ea48>] ? resched_curr+0x38/0xc0
> [3689738.157758] [<ffffffff810b5fff>] ? autoremove_wake_function+0x2f/0x50
> [3689738.157760] [<ffffffffa0493ef2>] ? tcp_error+0x1b2/0x210 [nf_conntrack]
> [3689738.157763] [<ffffffffa048e725>] ? nf_conntrack_in+0x3a5/0xb30 [nf_conntrack]
> [3689738.157765] [<ffffffff81481cb4>] ? sk_reset_timer+0x14/0x20
> [3689738.157768] [<ffffffff814cdeef>] ? nf_iterate+0x4f/0x80
> [3689738.157769] [<ffffffff814cdfb8>] ? nf_hook_slow+0x98/0xf0
> [3689738.157771] [<ffffffff814d52f4>] ? ip_rcv+0x314/0x400
> [3689738.157773] [<ffffffff814d48a0>] ? inet_add_protocol+0x50/0x50
> [3689738.157775] [<ffffffff81498ae3>] ? __netif_receive_skb_core+0x703/0x920
> [3689738.157777] [<ffffffff8101f405>] ? read_tsc+0x5/0x10
> [3689738.157778] [<ffffffff81498ecf>] ? netif_receive_skb_internal+0x1f/0x90
> [3689738.157780] [<ffffffff81499af0>] ? napi_gro_receive+0xb0/0xe0
> [3689738.157784] [<ffffffffa0097fe4>] ? e1000_clean_rx_irq+0x2b4/0x500 [e1000]
> [3689738.157787] [<ffffffffa0099ccc>] ? e1000_clean+0x26c/0x900 [e1000]
> [3689738.157789] [<ffffffff81499629>] ? net_rx_action+0x159/0x330
> [3689738.157791] [<ffffffff8107aace>] ? __do_softirq+0xde/0x260
> [3689738.157792] [<ffffffff8107ae95>] ? irq_exit+0x95/0xa0
> [3689738.157794] [<ffffffff815a0b74>] ? do_IRQ+0x64/0x110
> [3689738.157797] [<ffffffff8159e9ee>] ? common_interrupt+0x6e/0x6e

Has anyone seen a similar issue and knows if it is fixed post 4.1.16?

If you need more data, just ask and I will see what else I can gather.

Thank you in advance.

Philipp
--
Philipp Hahn
Open Source Software Engineer

Univention GmbH
be open.
Mary-Somerville-Str. 1
D-28359 Bremen
Tel.: +49 421 22232-0
Fax : +49 421 22232-99
hahn@xxxxxxxxxxxxx

http://www.univention.de/
GeschÃftsfÃhrer: Peter H. Ganten
HRB 20755 Amtsgericht Bremen
Steuer-Nr.: 71-597-02876
[3689736.061530] ------------[ cut here ]------------
[3689736.061539] WARNING: CPU: 0 PID: 29284 at linux-4.1.6/lib/list_debug.c:33 __list_add+0xc0/0xd0()
[3689736.061541] list_add corruption. prev->next should be next (ffffffff81ab3ca8), but was ffffffff81ab3cc8. (prev=ffff8804d9910d58).
[3689736.061542] Modules linked in: nfnetlink_log nfnetlink xt_addrtype xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 bridge stp llc overlay vmw_vsock_vmci_transport vsock ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_tcpudp nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_mangle ip6table_filter ip6_tables xt_state iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_filter ip_tables x_tables rpcsec_gss_krb5 nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc quota_v2 quota_tree vmw_balloon psmouse coretemp pcspkr serio_raw parport_pc 8250_fintek parport shpchp i2c_piix4 vmw_vmci ac acpi_cpufreq processor thermal_sys battery evdev ext4 crc16 mbcache jbd2 dm_snapshot dm_bufio dm_mirror dm_region_hash dm_log dm_mod sg sr_mod cdrom sd_mod ata_generic crc32c_intel e1000 floppy vmwgfx ttm ata_piix mptspi scsi_transport_spi mptscsih mptbase libata drm_kms_helper drm scsi_mod button
[3689736.061602] CPU: 0 PID: 29284 Comm: slapd Tainted: G W 4.1.0-ucs190-amd64 #1 Debian 4.1.6-1.190.201604142226
[3689736.061603] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015
[3689736.061604] 0000000000000000 ffffffff817531c0 ffffffff81597807 ffff88083fc038a8
[3689736.061606] ffffffff81076c45 ffff88004b553e00 ffffffff81ab3ca8 ffff8804d9910d58
[3689736.061608] 0000000000000001 0000000137090762 ffffffff81076d4a ffffffff81753310
[3689736.061609] Call Trace:
[3689736.061610] <IRQ> [<ffffffff81597807>] ? dump_stack+0x40/0x50
[3689736.061619] [<ffffffff81076c45>] ? warn_slowpath_common+0x95/0xe0
[3689736.061621] [<ffffffff81076d4a>] ? warn_slowpath_fmt+0x4a/0x50
[3689736.061624] [<ffffffff8130be50>] ? __list_add+0xc0/0xd0
[3689736.061627] [<ffffffff810da5a6>] ? internal_add_timer+0x36/0xa0
[3689736.061629] [<ffffffff810dc6fa>] ? mod_timer_pending+0xfa/0x140
[3689736.061635] [<ffffffffa048c441>] ? __nf_ct_refresh_acct+0xb1/0xc0 [nf_conntrack]
[3689736.061640] [<ffffffffa04945bc>] ? tcp_packet+0x66c/0x1500 [nf_conntrack]
[3689736.061643] [<ffffffff810b5fff>] ? autoremove_wake_function+0x2f/0x50
[3689736.061647] [<ffffffffa0493ef2>] ? tcp_error+0x1b2/0x210 [nf_conntrack]
[3689736.061650] [<ffffffffa048e725>] ? nf_conntrack_in+0x3a5/0xb30 [nf_conntrack]
[3689736.061654] [<ffffffff81481cb4>] ? sk_reset_timer+0x14/0x20
[3689736.061657] [<ffffffff814cdeef>] ? nf_iterate+0x4f/0x80
[3689736.061659] [<ffffffff814cdfb8>] ? nf_hook_slow+0x98/0xf0
[3689736.061662] [<ffffffff814d52f4>] ? ip_rcv+0x314/0x400
[3689736.061664] [<ffffffff814d48a0>] ? inet_add_protocol+0x50/0x50
[3689736.061668] [<ffffffff81498ae3>] ? __netif_receive_skb_core+0x703/0x920
[3689736.061670] [<ffffffff8101f405>] ? read_tsc+0x5/0x10
[3689736.061672] [<ffffffff81498ecf>] ? netif_receive_skb_internal+0x1f/0x90
[3689736.061673] [<ffffffff81499af0>] ? napi_gro_receive+0xb0/0xe0
[3689736.061678] [<ffffffffa0097fe4>] ? e1000_clean_rx_irq+0x2b4/0x500 [e1000]
[3689736.061681] [<ffffffffa0099ccc>] ? e1000_clean+0x26c/0x900 [e1000]
[3689736.061683] [<ffffffff81499629>] ? net_rx_action+0x159/0x330
[3689736.061685] [<ffffffff8107aace>] ? __do_softirq+0xde/0x260
[3689736.061687] [<ffffffff8107ae95>] ? irq_exit+0x95/0xa0
[3689736.061689] [<ffffffff815a0b74>] ? do_IRQ+0x64/0x110
[3689736.061691] [<ffffffff8159e9ee>] ? common_interrupt+0x6e/0x6e
[3689736.061692] <EOI>
[3689736.061693] ---[ end trace 8364fe1151c67412 ]---
[3689738.157669] ------------[ cut here ]------------
[3689738.157677] WARNING: CPU: 0 PID: 29284 at linux-4.1.6/lib/list_debug.c:33 __list_add+0xc0/0xd0()
[3689738.157678] list_add corruption. prev->next should be next (ffffffff81ab3cc8), but was ffffffff81ab3ca8. (prev=ffff8804d9910d58).
[3689738.157679] Modules linked in: nfnetlink_log nfnetlink xt_addrtype xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 bridge stp llc overlay vmw_vsock_vmci_transport vsock ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_tcpudp nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_mangle ip6table_filter ip6_tables xt_state iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_filter ip_tables x_tables rpcsec_gss_krb5 nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc quota_v2 quota_tree vmw_balloon psmouse coretemp pcspkr serio_raw parport_pc 8250_fintek parport shpchp i2c_piix4 vmw_vmci ac acpi_cpufreq processor thermal_sys battery evdev ext4 crc16 mbcache jbd2 dm_snapshot dm_bufio dm_mirror dm_region_hash dm_log dm_mod sg sr_mod cdrom sd_mod ata_generic crc32c_intel e1000 floppy vmwgfx ttm ata_piix mptspi scsi_transport_spi mptscsih mptbase libata drm_kms_helper drm scsi_mod button
[3689738.157718] CPU: 0 PID: 29284 Comm: slapd Tainted: G W 4.1.0-ucs190-amd64 #1 Debian 4.1.6-1.190.201604142226
[3689738.157719] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015
[3689738.157721] 0000000000000000 ffffffff817531c0 ffffffff81597807 ffff88083fc038a8
[3689738.157722] ffffffff81076c45 ffff8807b9f83578 ffffffff81ab3cc8 ffff8804d9910d58
[3689738.157724] 0000000000000001 000000013709096f ffffffff81076d4a ffffffff81753310
[3689738.157725] Call Trace:
[3689738.157726] <IRQ> [<ffffffff81597807>] ? dump_stack+0x40/0x50
[3689738.157734] [<ffffffff81076c45>] ? warn_slowpath_common+0x95/0xe0
[3689738.157735] [<ffffffff81076d4a>] ? warn_slowpath_fmt+0x4a/0x50
[3689738.157738] [<ffffffff810a8962>] ? select_task_rq_fair+0x412/0x610
[3689738.157740] [<ffffffff8130be50>] ? __list_add+0xc0/0xd0
[3689738.157742] [<ffffffff810da5a6>] ? internal_add_timer+0x36/0xa0
[3689738.157744] [<ffffffff810dc6fa>] ? mod_timer_pending+0xfa/0x140
[3689738.157748] [<ffffffffa048c441>] ? __nf_ct_refresh_acct+0xb1/0xc0 [nf_conntrack]
[3689738.157751] [<ffffffffa04945bc>] ? tcp_packet+0x66c/0x1500 [nf_conntrack]
[3689738.157753] [<ffffffff8101f9d5>] ? sched_clock+0x5/0x10
[3689738.157755] [<ffffffff8109ea48>] ? resched_curr+0x38/0xc0
[3689738.157758] [<ffffffff810b5fff>] ? autoremove_wake_function+0x2f/0x50
[3689738.157760] [<ffffffffa0493ef2>] ? tcp_error+0x1b2/0x210 [nf_conntrack]
[3689738.157763] [<ffffffffa048e725>] ? nf_conntrack_in+0x3a5/0xb30 [nf_conntrack]
[3689738.157765] [<ffffffff81481cb4>] ? sk_reset_timer+0x14/0x20
[3689738.157768] [<ffffffff814cdeef>] ? nf_iterate+0x4f/0x80
[3689738.157769] [<ffffffff814cdfb8>] ? nf_hook_slow+0x98/0xf0
[3689738.157771] [<ffffffff814d52f4>] ? ip_rcv+0x314/0x400
[3689738.157773] [<ffffffff814d48a0>] ? inet_add_protocol+0x50/0x50
[3689738.157775] [<ffffffff81498ae3>] ? __netif_receive_skb_core+0x703/0x920
[3689738.157777] [<ffffffff8101f405>] ? read_tsc+0x5/0x10
[3689738.157778] [<ffffffff81498ecf>] ? netif_receive_skb_internal+0x1f/0x90
[3689738.157780] [<ffffffff81499af0>] ? napi_gro_receive+0xb0/0xe0
[3689738.157784] [<ffffffffa0097fe4>] ? e1000_clean_rx_irq+0x2b4/0x500 [e1000]
[3689738.157787] [<ffffffffa0099ccc>] ? e1000_clean+0x26c/0x900 [e1000]
[3689738.157789] [<ffffffff81499629>] ? net_rx_action+0x159/0x330
[3689738.157791] [<ffffffff8107aace>] ? __do_softirq+0xde/0x260
[3689738.157792] [<ffffffff8107ae95>] ? irq_exit+0x95/0xa0
[3689738.157794] [<ffffffff815a0b74>] ? do_IRQ+0x64/0x110
[3689738.157797] [<ffffffff8159e9ee>] ? common_interrupt+0x6e/0x6e
[3689738.157797] <EOI>
[3689738.157798] ---[ end trace 8364fe1151c67413 ]---