Re: BNX2: Kernel crashes with 2.6.31 and 188.8.131.52
From: Benjamin Li
Date: Mon Mar 01 2010 - 20:26:23 EST
On Tue, 2010-02-23 at 04:15 -0800, Bruno Prémont wrote:
> Hi Benjamin,
> On Fri, 19 February 2010 "Benjamin Li" <benli@xxxxxxxxxxxx> wrote:
> > >From your logs it looks like the device came up using MSI, but in the
> > MSI-X poll routine was being called:
> > [ 9.836673] bnx2: eth0: using MSI
> > ...
> > [ 134.643459] [<ffffffffa004019e>] bnx2_poll_msix+0x3e/0xd0 [bnx2]
> > [ 134.643465] [<ffffffff8135bcd1>] netpoll_poll+0xe1/0x3c0
> > which is incorrect. If we are in MSI mode, the bnx2_poll() routine
> > should be used.
> > I think what is going on here is that during the bnx2x driver
> > initialization the current bnx2 driver adds all possible NAPI
> > structures that map to all the hardware vectors (BNX2_MAX_MSIX_VEC=9)
> > to the NAPI list in the net_device structure regardless if they are
> > used or not (Seen in drivers/net/bnx2.c:bnx2_init_napi()). This can
> > cause uninitialized NAPI structures to be placed on the napi_list.
> > Because this device is in MSI mode, only 1 vector is initialized.
> > Now, the problem is triggered when net/core/netpoll.c:poll_napi() is
> > called. This is because this routine will run through the entire
> > napi_list calling all the poll routines. In your particular case, it
> > is calling the poll routine on an uninitialized vector causing the
> > kernel panic.
> > Please try the patch below to see if it solves your problem. Note,
> > this only have been compile tested and tested against basic traffic
> > runs. Unfortunately, I could not reproduce the kernel panic with the
> > instructions below to verify the patch.
> > Thanks again for all your help in helping us track this down.
> I applied the patch today and tried to reproduce with my showcases.
> Seems that it's harder to trigger now but I still end up being able to
> crash the box. Don't know if it's the same cause or not (could also
> be the tcp-retransmit ghost)...
> This time I had to run a few paralell scp's (8Mb/s each) to the box and
> 'echo t > /proc/sysrq-trigger' multiple times via ssh session for it to
> happen. It didn't trigger with by netbomb though I will try some more
> and see)
> I don't know if it's the same reason or not (hopefully something
> reached disk as serial console is dead and pings are not
> answered anymore.
> It's probably some printk/bug/warn that triggers in network stack and
> deadlocks with netconsole.
Thanks for trying the patch. I still haven't been able to reproduce
what you are seeing here. I am able to run scp and 'echo t
> /proc/sysrq-trigger' multiple times. I was wondering if you had any
success reproducing the problem with a stack trace?
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/