[RFC PATCH 0/4] acpi: apei: Improve error handling with firmware-first

From: Alexandru Gagniuc
Date: Tue Apr 03 2018 - 13:08:55 EST


Hi,

I'm helping out Dell work out through the issues related to PCIe and NVMe
hotplug. Although hot-plug generally works, there are corner cases such as
pin bounce, drives failing and surprise removal that are not 100% worked out.
Because of this, NVMe is not yet on feature parity with SCSI and SAS.

One of the interesting issues is that most server vendors like to use
firmware-first (FFS), for various reasons that I won't go into. The side
effect of that is that we oftentimes don't even a stab at correcting the
problem.

This is especially troublesome for NVMe, which needs PCIe hotplug to work
correctly. When we do get a stab, it's after FFS can't handle a fatal error,
and we're told of it through ACPI tables. On x86, this happens through an
NMI, and as soon as we see a "fatal" error, we panic().

One problem with this FFS approach is that AER never even gets notified of
the issue. And even if a PCIe drive were to stop responding, nvme or higher
block drivers would notice something is wrong even without AER. Unless there
is a physical defect or silicon bug, AER can recover the link.

Another issue we're seeing with FFS is that BIOSes assume than an OS will crash
on a fatal error reported through ACPI. Sometimes they will leave hardware in
a "kind of" working state, or will fail to re-arm the errors. From what I've
observed, this happens on hardware with silicon bugs. For example, PCIe root
ports are unaffected, but certain PCIe switches may stop issuing hotplug
interrupts. It's just another headache with FFS.

While I don't expect server vendors to drop FFS in favor of native AER control,
I do think we can harden linux against the idiosyncrasies of such systems. The
scope of these patches is to protect against poorly designed firmware, and
perform proper error handling when possible. It is not to make FFS a first
class citizen in error handling.

Alexandru Gagniuc (4):
acpi: apei: Return severity of GHES messages after handling
acpi: apei: Swap ghes_print_queued_estatus and ghes_proc_in_irq
acpi: apei: Do not panic() in NMI because of GHES messages
acpi: apei: Warn when GHES marks correctable errors as "fatal"

drivers/acpi/apei/ghes.c | 100 ++++++++++++++++++++++++++++++-----------------
1 file changed, 64 insertions(+), 36 deletions(-)

--
2.14.3