Re: KVM induced panic on 2.6.38[2367] & 2.6.39
From: Eric Dumazet
Date: Wed Jun 08 2011 - 17:22:43 EST
Le jeudi 09 juin 2011 Ã 01:02 +0800, Brad Campbell a Ãcrit :
> On 08/06/11 11:59, Eric Dumazet wrote:
>
> > Well, a bisection definitely should help, but needs a lot of time in
> > your case.
>
> Yes. compile, test, crash, walk out to the other building to press
> reset, lather, rinse, repeat.
>
> I need a reset button on the end of a 50M wire, or a hardware watchdog!
>
> Actually it's not so bad. If I turn off slub debugging the kernel panics
> and reboots itself.
>
> This.. :
> [ 2.913034] netconsole: remote ethernet address 00:16:cb:a7:dd:d1
> [ 2.913066] netconsole: device eth0 not up yet, forcing it
> [ 3.660062] Refined TSC clocksource calibration: 3213.422 MHz.
> [ 3.660118] Switching to clocksource tsc
> [ 63.200273] r8169 0000:03:00.0: eth0: unable to load firmware patch
> rtl_nic/rtl8168e-1.fw (-2)
> [ 63.223513] r8169 0000:03:00.0: eth0: link down
> [ 63.223556] r8169 0000:03:00.0: eth0: link down
>
> ..is slowing down reboots considerably. 3.0-rc does _not_ like some
> timing hardware in my machine. Having said that, at least it does not
> randomly panic on SCSI like 2.6.39 does.
>
> Ok, I've ruled out TCPMSS. Found out where it was being set and neutered
> it. I've replicated it with only the single DNAT rule.
>
>
> > Could you try following patch, because this is the 'usual suspect' I had
> > yesterday :
> >
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 46cbd28..9f548f9 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -792,6 +792,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
> > fastpath = atomic_read(&skb_shinfo(skb)->dataref) == delta;
> > }
> >
> > +#if 0
> > if (fastpath&&
> > size + sizeof(struct skb_shared_info)<= ksize(skb->head)) {
> > memmove(skb->head + size, skb_shinfo(skb),
> > @@ -802,7 +803,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
> > off = nhead;
> > goto adjust_others;
> > }
> > -
> > +#endif
> > data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
> > if (!data)
> > goto nodata;
> >
> >
> >
>
> Nope.. that's not it. <sigh> That might have changed the characteristic
> of the fault slightly, but unfortunately I got caught with a couple of
> fsck's, so I only got to test it 3 times tonight.
>
> It's unfortunate that this is a production system, so I can only take it
> down between about 9pm and 1am. That would normally be pretty
> productive, except that an fsck of a 14TB ext4 can take 30 minutes if it
> panics at the wrong time.
>
> I'm out of time tonight, but I'll have a crack at some bisection
> tomorrow night. Now I just have to go back far enough that it works, and
> be near enough not to have to futz around with /proc /sys or drivers.
>
> I really, really, really appreciate you guys helping me with this. It
> has been driving me absolutely bonkers. If I'm ever in the same town as
> any of you, dinner and drinks are on me.
Hmm, I wonder if kmemcheck could help you, but its slow as hell, so not
appropriate for production :(
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/