Re: [PATCH 1/3] PCI/AER: Option to leave System Error Interrupts as-is

From: Keith Busch
Date: Fri Nov 02 2018 - 12:19:49 EST


On Fri, Nov 02, 2018 at 10:53:00AM +0100, Borislav Petkov wrote:
> On Mon, Oct 29, 2018 at 04:06:51PM -0500, Bjorn Helgaas wrote:
> > If I squint hard enough this sort of makes sense, but it also makes me
> > confused about the normal APEI firmware-first model works.
> >
> > In the NON-firmare-first case, firmware isn't involved in handling AER
> > errors. The Linux AER driver fields an interrupt from a Root Port,
> > reads AER log registers, etc.
> >
> > In the normal APEI firmware-first case, when the hardware reports an
> > AER event, I think firmware gets control first, and *it* reads the AER
> > log registers, packages them up, and generates an interrupt to the OS,
> > which reads the packaged error state from the firmware via the HEST.
> >
> > If I understand this special Intel VMD firmware-first case correctly,
> > firmware gets control first, reads the AER log registers, and
> > synthesizes what looks to the OS like a normal AER interrupt. The
>
> Why?
>
> Why the faking?
>
> If firmware needs to get control, why doesn't it then *retain* control
> and report the error through HEST, like others do?
>
> AFAIUC, fw wants to do something underneath. What's wrong with making it
> a normal firmware-first case?

VMD acts a bit like a host-bus adapter. The firmware knows about the
adapter, but not about anything on the bus that it attaches to.

This "hybrid" approach is basically saying that the firmware knows about
the HBA, and it wants a chance to be notified of errors on the bus it
attaches to, but the firmware can't do anything about such errors.

The bus in this case is PCIe, where we have capable error handling in the
kernel driver, so we ultimately want the AER driver handling the errors.