Re: [PATCH v3] PCI/AER: Handle Multi UnCorrectable/Correctable errors properly

From: Sathyanarayanan Kuppuswamy
Date: Wed May 11 2022 - 20:29:52 EST




On 5/11/22 4:40 PM, Bjorn Helgaas wrote:
On Mon, Apr 18, 2022 at 03:02:37PM +0000, Kuppuswamy Sathyanarayanan wrote:
Currently the aer_irq() handler returns IRQ_NONE for cases without bits
PCI_ERR_ROOT_UNCOR_RCV or PCI_ERR_ROOT_COR_RCV are set. But this
assumption is incorrect.

Consider a scenario where aer_irq() is triggered for a correctable
error, and while we process the error and before we clear the error
status in "Root Error Status" register, if the same kind of error
is triggered again, since aer_irq() only clears events it saw, the
multi-bit error is left in tact. This will cause the interrupt to fire
again, resulting in entering aer_irq() with just the multi-bit error
logged in the "Root Error Status" register.

Repeated AER recovery test has revealed this condition does happen
and this prevents any new interrupt from being triggered. Allow to
process interrupt even if only multi-correctable (BIT 1) or
multi-uncorrectable bit (BIT 3) is set.

Also note that, for cases with only multi-bit error is set, since this
is not the first occurrence of the error, PCI_ERR_ROOT_ERR_SRC may have
zero or some junk value. So we cannot cleanly process this error
information using aer_isr_one_error(). All we are attempting with this
fix is to make sure error interrupt processing can continue in this
scenario.

This error can be reproduced by making following changes to the
aer_irq() function and by executing the given test commands.

static irqreturn_t aer_irq(int irq, void *context)
struct aer_err_source e_src = {};

pci_read_config_dword(rp, aer + PCI_ERR_ROOT_STATUS,
&e_src.status);
+ pci_dbg(pdev->port, "Root Error Status: %04x\n",
+ e_src.status);
if (!(e_src.status & AER_ERR_STATUS_MASK))

Do you mean

if (!(e_src.status & (PCI_ERR_ROOT_UNCOR_RCV|PCI_ERR_ROOT_COR_RCV)))

here? AER_ERR_STATUS_MASK would be after this fix.

Yes. You are correct. Do you want me to update it and Fixes tag
and send next version?


return IRQ_NONE;

+ mdelay(5000);

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer