On 8/30/2017 11:16 AM, Borislav Petkov wrote:See below.
On Wed, Aug 30, 2017 at 10:05:44AM -0400, Sinan Kaya wrote:That's right. Each driver provides an err_handler hook. broadcast function
Link reset is not the only recovery mechanism. In the case of nonfatalI'm assuming that's broadcast_error_message()'s job.
errors, it is assumed that the endpoint CSR is still reachable.
Error is propagated the PCIe endpoint driver. Endpoint driver does a
re-initialization, we are back in business.
calls these.
static struct pci_driver e1000_driver = {
..
.err_handler = &e1000_err_handler
};
struct pci_error_handlers {
...
pci_ers_result_t (*error_detected)(struct pci_dev *dev,
enum pci_channel_state error);
}
I see. We should probably try to do something only if GHES_SEV_CORRECTED orThat's not true. The GHES code is changing the severity here before postingYou're missing the point that we would walk into that if branch *only* for
to the AER driver in ghes_do_proc().
if (gdata->flags & CPER_SEC_RESET)
aer_severity = AER_FATAL;
if (sev == GHES_SEV_RECOVERABLE &&
sec_sev == GHES_SEV_RECOVERABLE
severities. So if you have an AER_FATAL error but ghes severities are
not GHES_SEV_RECOVERABLE, nothing happens.
GHES_SEV_RECOVERABLE.
If somebody wants to crash the system with GHES_SEV_PANIC, there is no point
in doing additional work.
We do not need to worry about the GHES_SEV_PANIC case. Those get sent to __ghes_panic() in ghes_proc() without even making it to ghes_do_proc(). Those errors are just printed and then the kernel panics.Sounds good. Do you still want to do PCIe recovery in the case ofNo, AER ISR is not set up if firmware first is enabled.So then this is a major suckage. We do AER recovery on FF systems only
for GHES_SEV_RECOVERABLE severity.
The behavior should match non firmware-first case ideally.Yes, that makes sense.
1. Print all correctable errors.
2. Go to do_recovery for all uncorrectable errors including fatal and
non-fatal.
This is also what AER driver does in the absence of firmware first via
handle_error_source().
Which would mean that we'd call aer_recover_queue() regardless of GHES
severity but we'd do recovery only if GHES_SEV_RECOVERABLE is set
or CPER_SEC_RESET. I.e., we can communicate all that by setting the
correct AER severity before calling aer_recover_queue(). And then call
do_recovery() based on AER severity.
Hmmm?
GHES_SEV_PANIC or if some FW returns GHES_SEV_NO?