Re: [PATCH] x86: sysctl to allow panic on IOCK NMI error

From: Maciej W. Rozycki
Date: Fri Jul 03 2009 - 17:29:29 EST


On Thu, 2 Jul 2009, Ingo Molnar wrote:

> > Well, that's just a fast track to become a veteran, isn't it? ;)
>
> No, that's just a fast track to quickly make it into the list of our
> Fallen Heroes :-/ The fast track to become a kernel veteran is to,
> if possible, not challenge a tank with a hand-grenade. But i
> digress.

What doesn't kill you will make you stronger, ;) but otherwise I digress
too.

> > That shouldn't be a problem if we were about to panic(). For a
> > more sophisticated attempt of recovery -- yes, that would have to
> > be addressed.
>
> We are only panic-ing if the sysctl is set. The diagnostics would be
> useful anyway. The proper approach would be to defer it a bit in the
> non-panic case an read it out from some friendlier context - such as
> the EDAC core.

Hmm, my concern is in the case of a PCI SERR the system may not
necessarily be in a recoverable state. For example if a master abort
happened due to a timeout (which is outside the PCI spec I'm told, but the
only way to avoid holding the bus undefinitely) and the target finally
responded, then it may have corrupted a subsequent transaction. My point
is thus any diagnostic output should be produced as soon as possible and
involving as little system resources as absolutely necessary. This being
enough to identify the device triggering the SERR -- so that if an error
is fatal and recurs, then the possible offender can be determined.

Deferring such initial diagnostic to a softirq or suchlike does not sound
as a terribly good idea to me. I think this is also the right place to
disable the device's master access to the bus (and possibly target address
space decoders too -- the device may have started misdecoding and
interfering with transactions meant to involve other devices) -- till the
recovery procedure has been completed.

Then further processing, such as signalling the involved device's driver
that the error happened and letting it attempt to recover is something
that should happen in less restricted a context. It is the driver only
that could further determine the cause based on the state of the device's
registers (e.g. what was the target when the reporting device acted as a
master) and the knowledge of how it operates, reset the device, etc.
Once the situation has been rectified and the device determined to be
capable to continue operating (e.g. the built-in to the firmware self-test
-- if available -- was run and reported success) the device can be
reconfigured and put on the bus again.

Maciej
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/