Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER

From: Alex_Gagniuc
Date: Thu Aug 09 2018 - 12:46:40 EST


On 08/09/2018 09:16 AM, Bjorn Helgaas wrote:
> On Tue, Jul 17, 2018 at 10:31:23AM -0500, Alexandru Gagniuc wrote:
>> When we don't own AER, we shouldn't touch the AER error bits. This
>> happens unconditionally on device probe(). Clearing AER bits
>> willy-nilly might cause firmware to miss errors. Instead
>> these bits should get cleared by FFS, or via ACPI _HPX method.
>>
>> This race is mostly of theoretical significance, as it is not easy to
>> reasonably demonstrate it in testing.
>>
>> Signed-off-by: Alexandru Gagniuc <mr.nuke.me@xxxxxxxxx>
>> ---
>> drivers/pci/pcie/aer.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index a2e88386af28..18037a2a8231 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -383,6 +383,9 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
>> if (!pci_is_pcie(dev))
>> return -ENODEV;
>>
>> + if (pcie_aer_get_firmware_first(dev))
>> + return -EIO;
>
> I like this patch.
>
> Do we need the same thing in the following places that also clear AER
> status bits or write AER control bits?

In theory, every exported function would guard for this. I think the
idea a long long time ago was that the check happens during
initialization, and the others are not hit.

> enable_ecrc_checking()
> disable_ecrc_checking()

I don't immediately see how this would affect FFS, but the bits are part
of the AER capability structure. According to the FFS model, those would
be owned by FW, and we'd have to avoid touching them.

> pci_cleanup_aer_uncorrect_error_status()

This probably should be guarded. It's only called from a few specific
drivers, so the impact is not as high as being called from the core.

> pci_aer_clear_fatal_status()

This is only called when doing fatal_recovery, right?
For practical considerations this is not an issue today. The ACPI error
handling code currently crashes when it encounters any fatal error, so
we wouldn't hit this in the FFS case.

If the ACPI code pulls its thinking appendage out of the other end of
the digestive tract, then we could be hitting this in the future. For
correctness, guarding makes sense.

The PCIe standards contact I usually talk to about these PCIe subtleties
is currently on vacation. The number one issue was a FFS corner case
with OS clearing bits on probe. The other functions you mention are a
corner case of a corner case. The big fish is
pci_cleanup_aer_error_status_regs() on probe(), and it would be nice to
have that resolved.

I'll sync up with Austin when he gets back to see about the other
functions though I suspect we'll end up fixing them as well.

Alex

>> pos = dev->aer_cap;
>> if (!pos)
>> return -EIO;
>> --
>> 2.14.3
>>
>