Detailed report on SMB-build lockups [seems that it is locking problem in networking code] (2.4.0-test2-ac2 and later)

From: Alexander Demenshin (aldem-nf@aldem.net)
Date: Tue Jul 11 2000 - 05:50:32 EST


Hello folks,

        So here it is... I spent nearly 20 hours non-stop to find it out,
        to test under different circumstances, and finally found
        _where_ it is, but not _why_ (unfortunately)...
        
        Conditions to reproduce:
        
                - Very heavy traffic load (> 8000 pkts/sec);
                - Netfilter (built as modules or statically)
                - ip_queue and QUEUE target(s) in effect
                  (it is the only essential part of Netfilter for this case)
                - User-space module which ACCEPT all packets queued;
                - Traffic generator used on _local_ interface:
                
> A lot of fragmented packets:
                        
                                ifconfig lo mtu 256
                                ping -f -s 8192 127.0.0.1
                                
> A lot of TCP traffic (connect/transfer/disconnect);
> MTU does not matter.
                        
        In my tests I used the following rules for iptables:
        
                iptables -t mangle -A PREROUTING -j QUEUE
                iptables -t mangle -A OUTPUT -j QUEUE
                
        I assume there are no other rules; but the problem occurs _only_
        when QUEUE target is in effect - other rules does not matter as long
        as there is no QUEUE targets or if packets are not accepted in userspace.
        In case if I use table 'filter' it also occurs (so nothing magical
        in 'mangle' table).
        
        So, once rules above are in effect, userspace module is running, and after
        certain period of time running traffic generator system lockup occurs
        (in my case - after processing of ca. 300K packets; but it depends -
        be patient :).
        
        No OOPs, no other kernel messages, _nothing_ except SysRq is active.
        
        Examining of code under EIP shows, that lockup occurs at:
        
                - In case of TCP traffic:
                
                        src/net/ipv4/tcp_timer.c:690
                        
--- src/net/ipv4/tcp_timer.c:690 tcp_synack_timer() ---
                                /* Drop this request */
                                write_lock(&tp->syn_wait_lock); /* <<< AT THIS PLACE */
                                *reqp = req->dl_next;
                                write_unlock(&tp->syn_wait_lock);

--- CUT ---

                - In case of ICMP (fragmented) traffic:
                
--- src/net/ipv4/ip_fragment:202 ip_expire ---
        spin_lock(&ipfrag_lock); /* <<< AT THIS PLACE */
        if(!qp->fragments)
        {
#ifdef IP_EXPIRE_DEBUG
                printk("warning: possible ip-expire attack\n");
#endif
                goto out;
        }

--- CUT ---

        Again, problem occurs _only_ on SMB build - and it does not matter
        is it running SMP box or not (occurs on non-SMB and 2CPU box).
        
        During the tests there are no other network or disk activities
        except of sleeping regular daemons like getty.
        
        Problem still persist on 2.4.0-test3-pre8.
        
        I've no idea where to look now, and even no idea is it related to
        netfilter itself (or ip_queue)... Logically, it has to relate in some
        way to ip_queue (at least nf_reinject()) - because it occurs only when
        QUEUE target is active and packets are reinjected. It _does not_ occur
        when there is no QUEUE target or if packets are _not_ accepted.
        
        But (there is always but) - nf_reiject() and ip_queue itself produce
        a lot of calls to related networking parts; including netlink code.
        
        Hardware used for tests:
        
                - Single-CPU ASUS board:
                
                        PII-350 Deschutes (512K L2 Cache)
                        256M RAM
                        
                - Double-CPU ASUS board:
                
                        PIII-550 Coppermine (256L L2 Cache)
                        256M RAM
                        
        (I am not sure does it matter or not - may be only CPU freq to
        generate necessary amount of traffic).
        
        Software:
        
                - kernel 2.4.0-test2-ac2 (and later 2.4.0-test3-pre8)
                - iptables 1.1.0
                - ab ("apache bench" from apache distribution; used
                  as traffic generator: ab -n 1000000 http://localhost/)
                - any fast enough web server (to run "ab" at)
                - ping -f (for ICMP fragmented traffic).
        
        I suspect that problem will never occur under normal circumstances,
        but anyway it _does exist_, so it _may_ occur (sooner or later).
        
        Here is the link to my demo userspace ip_queue handling program (which
        can be used in test if someone will try to hunt this bug):
        
                http://aldem.net/netfilter/
                
        That's all. Any ideas? Suggestions? Comments? (Flames?) :)
        
        (I am only on netfilter list so please CC me in case if any additional
        information is necessary).

        Good luck!
        
/Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sat Jul 15 2000 - 21:00:12 EST