Re: [v5 Patch 1/3] netpoll: add generic support for bridge and bondingdevices

From: Cong Wang
Date: Thu May 27 2010 - 22:43:37 EST


On 05/28/10 02:05, Flavio Leitner wrote:

Hi guys!

I finally could test this to see if an old problem reported on bugzilla[1] was
fixed now, but unfortunately it is still there.

The ticket is private I guess, but basically the problem happens when bonding
driver tries to print something after it had taken the write_lock (monitor
functions, enslave/de-enslave), so the printk() will pass through netpoll, then
on bonding again which no matter what mode you use, it will try to read_lock()
the lock again. The result is a deadlock and the entire system hangs.


This is true, I already fixed some similar issues.


I manage to get a fresh backtrace with mode 1, see below:


[ 93.167079] Call Trace:
[ 93.167079] [<ffffffff81034cf9>] warn_slowpath_common+0x77/0x8f
[ 93.167079] [<ffffffff81034d5e>] warn_slowpath_fmt+0x3c/0x3e
[ 93.167079] [<ffffffff81366aef>] ? _raw_read_trylock+0x11/0x4b
[ 93.167079] [<ffffffffa02a2c42>] ? bond_start_xmit+0x12b/0x401 [bonding]
-> read_lock fails
[ 93.167079] [<ffffffffa02a2c9f>] bond_start_xmit+0x188/0x401 [bonding]
[ 93.167079] [<ffffffff81055b37>] ? trace_hardirqs_off+0xd/0xf
[ 93.167079] [<ffffffff812dfdb9>] netpoll_send_skb+0xbd/0x1f3
[ 93.167079] [<ffffffff812e00ed>] netpoll_send_udp+0x1fe/0x20d
[ 93.167079] [<ffffffffa02c017c>] write_msg+0x89/0xcd [netconsole]
[ 93.167079] [<ffffffff81034e65>] __call_console_drivers+0x67/0x79
[ 93.167079] [<ffffffff81034ed0>] _call_console_drivers+0x59/0x5d
[ 93.167079] [<ffffffff810352d3>] release_console_sem+0x121/0x1d7
[ 93.167079] [<ffffffff8103590a>] vprintk+0x35d/0x393
[ 93.167079] [<ffffffff8103f947>] ? add_timer+0x17/0x19
[ 93.167079] [<ffffffff81046ddf>] ? queue_delayed_work_on+0xa2/0xa9
[ 93.167079] [<ffffffff81363bb8>] printk+0x3c/0x44
[ 93.167079] [<ffffffffa02a3b17>] bond_select_active_slave+0x105/0x109 [bonding]
-> write_locked
[ 93.167079] [<ffffffffa02a4798>] bond_mii_monitor+0x479/0x4ed [bonding]
[ 93.167079] [<ffffffff81046009>] worker_thread+0x1ef/0x2e2

In this case, the message should be
"bonding: bond0: making interface eth0 the new active one"


Hmm, you triggered a warning here, let me check the source code
and try to reproduce it here.


I did the following patch to discard the packet if it was IN_NETPOLL
and the read_lock() fails, so I could go ahead testing it:

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 5e12462..a3b8bad 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4258,8 +4258,19 @@ static int bond_xmit_activebackup(struct sk_buff *skb, struct net_device *bond_d
struct bonding *bond = netdev_priv(bond_dev);
int res = 1;

- read_lock(&bond->lock);
- read_lock(&bond->curr_slave_lock);
+ if (read_trylock(&bond->lock) == 0&&
+ (bond_dev->flags& IFF_IN_NETPOLL)) {
+ dev_kfree_skb(skb);
+ return NETDEV_TX_OK;
+ }
+
+ if (read_trylock(&bond->curr_slave_lock) == 0&&
+ (bond_dev->flags& IFF_IN_NETPOLL)) {
+ read_unlock(&bond->lock);
+ dev_kfree_skb(skb);
+ return NETDEV_TX_OK;
+ }
+

if (!BOND_IS_OK(bond))
goto out;



This looks like a workaround, not a fix. :)


and I found another problem. The function netpoll_send_skb() checks
if the npinfo's queue length is zero and if it's not, it will queue
the packet to make sure it's in order and then schedule the thread
to run. Later, the thread wakes up running queue_process() which disables
interrupts before calling ndo_start_xmit(). However, dev_queue_xmit()
uses rcu_*_bh() and before return, it will enable the interrupts again,
spitting this:

------------[ cut here ]------------
WARNING: at kernel/softirq.c:143 local_bh_enable+0x3c/0x86()
Hardware name: Precision WorkStation 490
Modules linked in: netconsole bonding sunrpc ip6t_REJECT xt_tcpudp nf_conntrack_ipv6]
Pid: 17, comm: events/2 Not tainted 2.6.34-04700-gd938a70 #21
Call Trace:
[<ffffffff810381d6>] warn_slowpath_common+0x77/0x8f
[<ffffffff810381fd>] warn_slowpath_null+0xf/0x11
[<ffffffff8103d691>] local_bh_enable+0x3c/0x86
[<ffffffff812e4d85>] dev_queue_xmit+0x462/0x493
[<ffffffffa018805f>] bond_dev_queue_xmit+0x1bd/0x1e3 [bonding]
[<ffffffffa01881dd>] bond_start_xmit+0x158/0x37b [bonding]
-> interrupts disabled
[<ffffffff812f3fca>] queue_process+0x9d/0xf9
[<ffffffff8104d022>] worker_thread+0x19d/0x224
[<ffffffff812f3f2d>] ? queue_process+0x0/0xf9
[<ffffffff81050819>] ? autoremove_wake_function+0x0/0x34
[<ffffffff8104ce85>] ? worker_thread+0x0/0x224
[<ffffffff8105040b>] kthread+0x7a/0x82
[<ffffffff810036d4>] kernel_thread_helper+0x4/0x10
[<ffffffff81050391>] ? kthread+0x0/0x82
[<ffffffff810036d0>] ? kernel_thread_helper+0x0/0x10
---[ end trace 74e3904503fdb632 ]---

kernel/softirq.c:
141 static inline void _local_bh_enable_ip(unsigned long ip)
142 {
143 WARN_ON_ONCE(in_irq() || irqs_disabled());
144 #ifdef CONFIG_TRACE_IRQFLAGS
145 local_irq_disable();
146 #endif
147 /*
148 * Are softirqs going to be turned on now:
149 */



I am wondering if this was caused by the previous issue.


The git is updated up to:
d938a70 be2net: increase POST timeout for EEH recovery

Two slave interfaces, bonding mode 1, netconsole over bond0.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=248374#c5

How did you reproduce this?
I will check that BZ to see if I can find how to reproduce this.

Thanks.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/