Re: [Patch] bonding: fix netpoll in active-backup mode

From: Cong Wang
Date: Tue Mar 08 2011 - 03:27:11 EST


ä 2011å03æ08æ 12:15, Cong Wang åé:
ä 2011å03æ08æ 02:50, Neil Horman åé:
On Mon, Mar 07, 2011 at 10:11:50PM +0800, Amerigo Wang wrote:
netconsole doesn't work in active-backup mode, because we don't do anything
for nic failover in active-backup mode. This patch fixes the problem by:

1) make slave_enable_netpoll() and slave_disable_netpoll() callable in softirq
context, that is, moving code after synchronize_rcu_bh() into call_rcu_bh()
callback function, teaching kzalloc() to use GFP_ATOMIC.

2) disable netpoll on old slave and enable netpoll on the new slave.

Tested by ifdown the current active slave and ifup it again for several times,
netconsole works well.

Signed-off-by: WANG Cong<amwang@xxxxxxxxxx>

I may be missing soething but this seems way over-complicated to me. I presume
the problem is that in active backup mode a failover results in the new active
slave not having netpoll setup on it? If thats the case, why not just setup
netpoll on all slaves when ndo_netpoll_setup is called on the bonding interface?
I don't see anything immeidately catastrophic that would happen as a result.


But we still need to clean up the netpoll on the failing slave, which still
needs to call slave_disable_netpoll() in monitor code, I see no big differences
with the solution I take.


And then you wouldn't have to worry about disabling/enabling anything on a
failover (or during a panic for that matter). As for the rcu bits? Why are
they needed? One would presume that wouldn't (or at least shouldn't) be able to
teardown our netpoll setup until such time as all the pending frames for that
netpoll client have been transmitted. If we're not blocknig on that RCU isn't
really going to help. Seems like the proper fix is take a reference to the
appropriate npinfo struct in netpoll_send_skb, and drop it from the skbs
destructor or some such.

I saw a "scheduling while in atomic" warning without touching the rcu bits.


Hmm, I was wrong, this warning is misleading, I think the root cause is that
I call slave_disable_netpoll() with write_lock_bh() held...

Will update the patch soon...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/