RE: [BUG] 4.11.0-rc1 panic on shutdown X61s

From: Brown, Aaron F
Date: Mon Mar 13 2017 - 21:20:40 EST

> From: BjÃrn Mork [mailto:bjorn@xxxxxxx]
> Sent: Monday, March 13, 2017 9:46 AM
> To: Borislav Petkov <bp@xxxxxxxxx>
> Cc: Andy Shevchenko <andy.shevchenko@xxxxxxxxx>; lkml@xxxxxxxxxxx;
> linux-kernel <linux-kernel@xxxxxxxxxxxxxxx>; vcaputo@xxxxxxxxxxx; linux-
> pci@xxxxxxxxxxxxxxx; intel-wired-lan@xxxxxxxxxxxxxxxx; khalidm
> <khalidm@xxxxxxxxx>; David Singleton <davsingl@xxxxxxxxx>; Brown, Aaron
> F <aaron.f.brown@xxxxxxxxx>; Kirsher, Jeffrey T
> <jeffrey.t.kirsher@xxxxxxxxx>
> Subject: Re: [BUG] 4.11.0-rc1 panic on shutdown X61s
> Borislav Petkov <bp@xxxxxxxxx> writes:
> > On Sun, Mar 12, 2017 at 03:55:08PM +0200, Andy Shevchenko wrote:
> >
> >> The only change that IMHO matters happened between v4.10 and v4.11-
> rc1 is this:
> >>
> >> @@ -6276,8 +6274,8 @@ static int e1000e_pm_freeze(struct device *dev)
> >> /* Quiesce the device without resetting the hardware */
> >> e1000e_down(adapter, false);
> >> e1000_free_irq(adapter);
> >> + e1000e_reset_interrupt_capability(adapter);
> >> }
> >> - e1000e_reset_interrupt_capability(adapter);
> >>
> >> So, it apparently misses something for the other case, like
> >> pci_disable_msi() call or so.
> >
> > Well, lemme add the people from
> >
> > 7e54d9d063fa ("e1000e: driver trying to free already-free irq")
> >
> > to CC then. :-)
> Already did that a week ago:
> Haven't heard anything back yet. Wondering if they are waiting for
> someone else to submit the pretty obvious revert? Don't understand why
> that should take more than a minute to figure out. It's not like they
> are testing these changes anyway...

Believe it or not we actually do test these changes. This one was tested by me and I did not have the same results you and the other people reporting this trace did. I made it back in the lab today and have spent a good part of the day attempting to reproduce this bug without success. Freeze / resume works for me on all the systems I have tried, which includes a sampling of all the current parts and many older ones. Given there are several other reports of this it is obviously an issue and I would like to be able to reproduce it in case another patch to resolve the issue this attempts to fix comes back in another form. So I want to know what's different between the systems that hit this and my bank of systems that don't.

What exact part (or parts) are we looking at (lspci|grep -i eth) that trigger this? Could it be a difference in .config files? The trace says it is falling back to legacy interrupts, does the system continue to work and does the network continue to function in that mode? In case it's related to user space what is the base distro? Any other information you think can help me reproduce the issue would be appreciated.


> BjÃrn