Re: [PATCH] netfilter: nf_conntrack: fix RCU race innf_conntrack_find_get (v3)

From: Andrew Vagin
Date: Tue Jan 14 2014 - 05:52:09 EST


On Sun, Jan 12, 2014 at 12:21:14PM -0800, Eric Dumazet wrote:
> On Sun, 2014-01-12 at 21:50 +0400, Andrey Vagin wrote:
> > Lets look at destroy_conntrack:
> >
> > hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
> > ...
> > nf_conntrack_free(ct)
> > kmem_cache_free(net->ct.nf_conntrack_cachep, ct);
> >
> > net->ct.nf_conntrack_cachep is created with SLAB_DESTROY_BY_RCU.
> >
> > The hash is protected by rcu, so readers look up conntracks without
> > locks.
> > A conntrack is removed from the hash, but in this moment a few readers
> > still can use the conntrack. Then this conntrack is released and another
> > thread creates conntrack with the same address and the equal tuple.
> > After this a reader starts to validate the conntrack:
> > * It's not dying, because a new conntrack was created
> > * nf_ct_tuple_equal() returns true.
> ...
>
>
> > v2: move nf_ct_is_confirmed into the unlikely() annotation
> > v3: Eric suggested to fix refcnt, so that it becomes zero before adding
> > in a hash, but we can't find a way how to do that. Another way is to
> > interpret the confirm bit as part of a search key and check it in
> > ____nf_conntrack_find() too.
> >
> > Cc: Eric Dumazet <eric.dumazet@xxxxxxxxx>
> > Cc: Florian Westphal <fw@xxxxxxxxx>
> > Cc: Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx>
> > Cc: Patrick McHardy <kaber@xxxxxxxxx>
> > Cc: Jozsef Kadlecsik <kadlec@xxxxxxxxxxxxxxxxx>
> > Cc: "David S. Miller" <davem@xxxxxxxxxxxxx>
> > Cc: Cyrill Gorcunov <gorcunov@xxxxxxxxxx>
> > Signed-off-by: Andrey Vagin <avagin@xxxxxxxxxx>
> > ---
>
> Acked-by: Eric Dumazet <edumazet@xxxxxxxxxx>
>
> Thanks Andrey !
>

Eh, looks like this path is incomplete too:(

I think we can't set a reference counter for objects which is allocated
from a SLAB_DESTROY_BY_RCU cache. Look at the following backtrace.

cpu1 cpu2
ct = ____nf_conntrack_find()
destroy_conntrack
atomic_inc_not_zero(ct)
__nf_conntrack_alloc
atomic_set(&ct->ct_general.use, 1);
if (!nf_ct_key_equal(h, tuple, zone))
nf_ct_put(ct);
destroy_conntrack(ct) !!!!
/* continues to use the conntrack */


Did I miss something again?

I think __nf_conntrack_alloc must use atomic_inc instead of
atomic_set. And we must be sure, that the first object from a new page is
zeroized.

I am talking about this, because after this patch a bug was triggered from
another place:

<2>[67096.759353] kernel BUG at net/netfilter/nf_conntrack_core.c:211!
<4>[67096.759371] invalid opcode: 0000 [#1] SMP
<4>[67096.759385] last sysfs file: /sys/devices/virtual/block/md0/md/metadata_version
<4>[67096.759414] CPU 2
<4>[67096.759422] Modules linked in: xt_comment sch_sfq cls_fw sch_htb pio_nfs pio_direct pfmt_raw pfmt_ploop1 ploop simfs xt_string xt_hashlimit xt_recent xt_length xt_hl xt_tcpmss xt_TCPMSS xt_multiport xt_limit xt_dscp vzevent coretemp cpufreq_ondemand acpi_cpufreq freq_table mperf 8021q garp stp llc ipt_REJECT iptable_filter iptable_mangle xt_NOTRACK iptable_raw iptable_nat ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state ip6table_filter ip6table_raw xt_MARK ip6table_mangle ip6_tables ext4 jbd2 tun ip_gre ipip vzethdev vznetdev vzrst nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 ipv6 vzcpt nf_conntrack vzdquota vzmon vziolimit vzdev tunnel4 nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc tpm_tis tpm tpm_bios microcode serio_raw i2c_i801 sg iTCO_wdt iTCO_vendor_support e1000e ext3 jbd mbcache raid1 sd_mod crc_t10dif ata_piix ahci pata_acpi ata_generic i915 drm_kms_helper drm i2c_algo_bit i2c_core video output dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
<4>[67096.759801]
<4>[67096.759837] Pid: 498649, comm: atdd veid: 666 Tainted: G C --------------- 2.6.32-042stab084.18 #1 042stab084_18 /DQ45CB
<4>[67096.759932] RIP: 0010:[<ffffffffa03d99ac>] [<ffffffffa03d99ac>] destroy_conntrack+0x15c/0x190 [nf_conntrack]
<4>[67096.760032] RSP: 0000:ffff88001ae378b8 EFLAGS: 00010246
<4>[67096.760075] RAX: 0000000000000000 RBX: ffff8801a57ac928 RCX: 0000000000065000
<4>[67096.760123] RDX: 000000000000f603 RSI: 0000000000000006 RDI: ffff8801a57ac928
<4>[67096.760174] RBP: ffff88001ae378d8 R08: 0000000000000002 R09: ffff8802373b06e0
<4>[67096.760221] R10: 0000000000000001 R11: 0000000000000000 R12: ffff88023928c080
<4>[67096.760255] R13: ffff880237e8c000 R14: 0000000000000002 R15: 0000000000000002
<4>[67096.760255] FS: 0000000000000000(0000) GS:ffff880028300000(0063) knlGS:00000000b63afbb0
<4>[67096.760255] CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b
<4>[67096.760255] CR2: 00000000b74f44c0 CR3: 00000000b89c6000 CR4: 00000000000007e0
<4>[67096.760255] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[67096.760255] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[67096.760255] Process atdd (pid: 498649, veid: 666, threadinfo ffff88001ae36000, task ffff88001deaa980)
<4>[67096.760255] Stack:
<4>[67096.760255] ffff88001ae378e8 ffff88001ae37988 ffff88023928c080 0000000000000003
<4>[67096.760255] <d> ffff88001ae378e8 ffffffff814844a7 ffff88001ae37908 ffffffffa03d9bb5
<4>[67096.760255] <d> ffff88012dcae580 ffff88023928c080 ffff88001ae379e8 ffffffffa03d9fb2
<4>[67096.760255] Call Trace:
<4>[67096.760255] [<ffffffff814844a7>] nf_conntrack_destroy+0x17/0x30
<4>[67096.760255] [<ffffffffa03d9bb5>] nf_conntrack_find_get+0x85/0x130 [nf_conntrack]
<4>[67096.760255] [<ffffffffa03d9fb2>] nf_conntrack_in+0x352/0xb60 [nf_conntrack]
<4>[67096.760255] [<ffffffffa048c771>] ipv4_conntrack_local+0x51/0x60 [nf_conntrack_ipv4]
<4>[67096.760255] [<ffffffff81484419>] nf_iterate+0x69/0xb0
<4>[67096.760255] [<ffffffff814b5b00>] ? dst_output+0x0/0x20
<4>[67096.760255] [<ffffffff814845d4>] nf_hook_slow+0x74/0x110
<4>[67096.760255] [<ffffffff814b5b00>] ? dst_output+0x0/0x20
<4>[67096.760255] [<ffffffff814b66d5>] raw_sendmsg+0x775/0x910
<4>[67096.760255] [<ffffffff8104c5a8>] ? flush_tlb_others_ipi+0x128/0x130
<4>[67096.760255] [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255] [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255] [<ffffffff814c136a>] inet_sendmsg+0x4a/0xb0
<4>[67096.760255] [<ffffffff81444e93>] ? sock_sendmsg+0x13/0x140
<4>[67096.760255] [<ffffffff81444f97>] sock_sendmsg+0x117/0x140
<4>[67096.760255] [<ffffffff8102e299>] ? native_smp_send_reschedule+0x49/0x60
<4>[67096.760255] [<ffffffff81519beb>] ? _spin_unlock_bh+0x1b/0x20
<4>[67096.760255] [<ffffffff8109d930>] ? autoremove_wake_function+0x0/0x40
<4>[67096.760255] [<ffffffff814960f0>] ? do_ip_setsockopt+0x90/0xd80
<4>[67096.760255] [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255] [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255] [<ffffffff814457c9>] sys_sendto+0x139/0x190
<4>[67096.760255] [<ffffffff810efa77>] ? audit_syscall_entry+0x1d7/0x200
<4>[67096.760255] [<ffffffff810ef7c5>] ? __audit_syscall_exit+0x265/0x290
<4>[67096.760255] [<ffffffff81474daf>] compat_sys_socketcall+0x13f/0x210
<4>[67096.760255] [<ffffffff8104dea3>] ia32_sysret+0x0/0x5
<4>[67096.760255] Code: 0b ab 0a e1 eb b7 f6 05 34 f8 00 e2 20 74 b7 80 3d f0 b0 00 00 00 74 ae 48 89 de 48 c7 c7 20 16 3e a0 31 c0 e8 05 ca 13 e1 eb 9b <0f> 0b eb fe f6 05 0b f8 00 e2 20 0f 84 db fe ff ff 80 3d eb b0
<1>[67096.760255] RIP [<ffffffffa03d99ac>] destroy_conntrack+0x15c/0x190 [nf_conntrack]
<4>[67096.760255] RSP <ffff88001ae378b8>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/