Ingo Molnar a écrit :* Ingo Molnar <mingo@xxxxxxx> wrote:
a test-box still triggered this crash overnight:IPS_CONFIRMED_BIT is set under nf_conntrack_lock (in __nf_conntrack_confirm()), we probably want to add a synchronisation under ct->lock as well, or __nf_ct_refresh_acct() could set ct->timeout.expires to extra_jiffies, while a different cpu could confirm the conntrack.A quick test suggests that it seems to works here - thanks Eric!
Following patch as RFC
[ 252.433471] ------------[ cut here ]------------
[ 252.436031] WARNING: at lib/list_debug.c:30 __list_add+0x95/0xa0()
[ 252.436031] Hardware name: System Product Name
[ 252.436031] list_add corruption. prev->next should be next (ffff88003fa1d460), but was ffffffff82e560a0. (prev=ffff880003b458c0).
[ 252.436031] Pid: 7348, comm: ssh Tainted: G W 2.6.30-tip #54604
[ 252.436031] Call Trace:
[ 252.436031] [<ffffffff8149eda5>] ? __list_add+0x95/0xa0
[ 252.436031] [<ffffffff8105c79b>] warn_slowpath_common+0x7b/0xd0
[ 252.436031] [<ffffffff8105c851>] warn_slowpath_fmt+0x41/0x50
[ 252.436031] [<ffffffff8149eda5>] __list_add+0x95/0xa0
[ 252.436031] [<ffffffff8106937e>] internal_add_timer+0x9e/0xf0
[ 252.436031] [<ffffffff8106a5ef>] mod_timer+0x10f/0x160
[ 252.436031] [<ffffffff8106a658>] add_timer+0x18/0x20
[ 252.436031] [<ffffffff81f6e42a>] __nf_conntrack_confirm+0x1da/0x3c0
[ 252.436031] [<ffffffff81fdb8dd>] ipv4_confirm+0xfd/0x160
[ 252.436031] [<ffffffff81f6a130>] nf_iterate+0x70/0xd0
[ 252.436031] [<ffffffff81f99180>] ? ip_finish_output+0x0/0x380
[ 252.436031] [<ffffffff81f6a4c4>] nf_hook_slow+0xe4/0x160
[ 252.436031] [<ffffffff81f99180>] ? ip_finish_output+0x0/0x380
[ 252.436031] [<ffffffff81f995f5>] ip_output+0xf5/0x110
[ 252.436031] [<ffffffff81f96b05>] ip_local_out+0x25/0x40
[ 252.436031] [<ffffffff81f97434>] ip_queue_xmit+0x224/0x420
[ 252.436031] [<ffffffff81111118>] ? __kmalloc_node_track_caller+0xd8/0x1f0
[ 252.436031] [<ffffffff8108df19>] ? trace_hardirqs_on_caller+0x29/0x1f0
[ 252.436031] [<ffffffff81fae0dd>] tcp_transmit_skb+0x50d/0x7e0
[ 252.436031] [<ffffffff81faf547>] tcp_connect+0x3c7/0x500
[ 252.436031] [<ffffffff81fb4693>] tcp_v4_connect+0x433/0x520
[ 252.436031] [<ffffffff81fc446f>] inet_stream_connect+0x22f/0x2d0
[ 252.436031] [<ffffffff81118719>] ? fget_light+0x19/0x110
[ 252.436031] [<ffffffff81f294b8>] sys_connect+0xb8/0xd0
[ 252.436031] [<ffffffff8100ccf9>] ? retint_swapgs+0x13/0x1b
[ 252.436031] [<ffffffff8108df19>] ? trace_hardirqs_on_caller+0x29/0x1f0
[ 252.436031] [<ffffffff8217a49f>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 252.436031] [<ffffffff8100c252>] system_call_fastpath+0x16/0x1b
[ 252.436031] ---[ end trace a7919e7f17c0a73d ]---
With your patch (repeated below) applied. Is Patrick's alternative patch supposed to fix something that yours does not?
Hmm, not really, Patrick patch should fix same problem, but without extra locking
as mine.
This new stack trace is somewhat different, as corruption is detected in the add_timer()
call in __nf_conntrack_confirm()
So there is another problem. CCed Pablo Neira Ayuso who added some stuff
in netfilter and timeout logic recently.