[RFC] SERR# handling by Linux

From: poza
Date: Mon Jul 23 2018 - 04:50:00 EST


Hi Bjorn and Keith,

This discussion is to extend the idea of follwing patch.
[PATCH] PCI/AER: Enable SERR# forwarding in non ACPI flow

PCIe Spec
7.6.2.1.3 Command Register (Offset 04h)

SERR# Enable â See Section 7.6.2.1.14.
When Set, this bit enables reporting upstream of Non-fatal and Fatal errors detected by the Function to the Root Complex. Note that errors are reported if enabled either through this bit or through the PCI Express specific bits in the Device Control register (see Section 7.6.3.4).
In addition, for Functions with Type 1 Configuration Space headers, this bit controls transmission by the primary interface of ERR_NONFATAL and ERR_FATAL error Messages forwarded from the secondary interface. This bit does not affect the transmission of forwarded ERR_COR messages.
A Root Complex Integrated Endpoint that is not associated with a Root Complex Event Collector is permitted to hardwire this bit to 0b.
Default value of this bit is 0b.


7.6.2.3.13 Bridge Control Register
SERR# Enable â See Section 7.6.2.1.147.5.1.8.
This bit controls forwarding of ERR_COR, ERR_NONFATAL and ERR_FATAL from secondary to primary.

6.2.3.2.2
The transmission of these error Messages by class (correctable, non-fatal, fatal) is enabled using the Reporting Enable bits of the Device Control register (see Section 7.6.3.4) or the SERR# Enable bit in the PCI Command register (see Section 7.6.2.1.3).


AER driver touches device control (and choose not to touch PCI_COMMAND)
On the hand SERR# of Bridge Control Register is not set either.

The meaning of both the SERR# for type-1 configuration space seems to me the same.
both essentially says that ERR_NONFATAL and ERR_FATAL from secondary to primary.
except that bridge control setting, also forwards ERR_COR messages while Command Register settings affect only ERR_NONFATAL and ERR_FATAL.

there are 2 cases,
1)hotplug Enabled slot is inserted with type-1 configuration space (bridge) and
2) hot plug disabled, where on our platform we typically set #SERR by firmware

So yes it makes sense to set #SERR bit by AER driver if it fins bridge.
but we not only have do
[PATCH] PCI/AER: Enable SERR# forwarding in non ACPI flow

but also we have to cover hotplug case and hence
pci_aer_init() should call
pci_enable_pcie_error_reporting(dev);


something like below.

int pci_aer_init(struct pci_dev *dev)
{
int rc;

dev->aer_cap = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR);

pci_enable_pcie_error_reporting(dev);
return pci_cleanup_aer_error_status_regs(dev);
}


int pci_enable_pcie_error_reporting(struct pci_dev *dev)
{
int ret;

if (pcie_aer_get_firmware_first(dev))
return -EIO;

if (!dev->aer_cap)
return -EIO;

if (!IS_ENABLED(CONFIG_ACPI) &&
dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
u16 control;

/*
* A Type-1 PCI bridge will not forward ERR_ messages coming
* from an endpoint if SERR# forwarding is not enabled.
*/
pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &control);
control |= PCI_BRIDGE_CTL_SERR;
pci_write_config_word(dev, PCI_BRIDGE_CONTROL, control);
}

return pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS);
}
EXPORT_SYMBOL_GPL(pci_enable_pcie_error_reporting);

also we have to remove pci_enable_pcie_error_reporting() call from the drivers.
because aer_init will do it for all the devices.

although I am not very sure is it safe to detect enable error reporting by default for all the error devices ?
e.g. setting PCI_EXP_DEVCTL.

probably drivers might want to call pci_disable_pcie_error_reporting() who doesnt want to participate in error reporting.

Regards,
Oza.