Hi Lukas/Stuart,
Want to follow up with you whether the system hang is expected when HW has a defect keeping PCI_EXP_SLTSTA_PFD always HIGH.
Regards
Joseph
-----Original Message-----
From: Bjorn Helgaas <helgaas@xxxxxxxxxx>> Sent: Wednesday, November 3, 2021 6:34 AM
To: Bao, Joseph <joseph.bao@xxxxxxxxx>
Cc: Bjorn Helgaas <bhelgaas@xxxxxxxxxx>; linux-pci@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; Stuart Hayes <stuart.w.hayes@xxxxxxxxx>; Lukas Wunner <lukas@xxxxxxxxx>
Subject: Re: HW power fault defect cause system hang on kernel 5.4.y
[+cc Stuart, author of 8edf5332c393 ("PCI: pciehp: Fix MSI interrupt race"), Lukas, pciehp expert]
On Tue, Nov 02, 2021 at 03:45:00AM +0000, Bao, Joseph wrote:
Hi, dear kernel developer,
Recently we encounter system hang (dead spinlock) when move to kernel
linux-5.4.y.
Finally, we use bisect to locate the suspicious commit
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=4667358dab9cc07da044d5bc087065545b1000df.
4667358dab9c backported upstream commit 8edf5332c393 ("PCI: pciehp:
Fix MSI interrupt race") to v5.4.69 just over a year ago.
Our system has some HW defect, which will wrongly set
PCI_EXP_SLTSTA_PFD high, and this commit will lead to infinite loop
jumping to read_status (no chance to clear status PCI_EXP_SLTSTA_PFD
bit since ctrl is not updated), I know this is our HW defect, but this
commit makes kernel trapped in this isr function and leads to kernel
hang (then the user could not get useful information to show what's
wrong), which I think is not expected behavior, so I would like to
report to you for discussion.
I guess this happens because the first time we handle PFD,
pciehp_ist() sets "ctrl->power_fault_detected = 1", and when power_fault_detected is set, pciehp_isr() won't clear PFD from PCI_EXP_SLTSTA?
It looks like the only place we clear power_fault_detected is in pciehp_power_on_slot(), and I don't think we call that unless we have a presence detect or link status change.
It would definitely be nice if we could arrange so this hardware defect didn't cause a kernel hang.
I think the diff below is the backport of 8edf5332c393 ("PCI: pciehp:
Fix MSI interrupt race").
diff --git a/drivers/pci/hotplug/pciehp_hpc.c
b/drivers/pci/hotplug/pciehp_hpc.c
index 356786a3b7f4b..88b996764ff95 100644
---
a/https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tre
e/drivers/pci/hotplug/pciehp_hpc.c?h=linux-5.4.y&id=ca767cf0152d18fc29
9cde85b18d1f46ac21e1ba
+++ b/https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
+++ /tree/drivers/pci/hotplug/pciehp_hpc.c?h=linux-5.4.y&id=4667358dab
+++ 9cc07da044d5bc087065545b1000df
@@ -529,7 +529,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id)
struct controller *ctrl = (struct controller *)dev_id;
struct pci_dev *pdev = ctrl_dev(ctrl);
struct device *parent = pdev->dev.parent;
- u16 status, events;
+ u16 status, events = 0;
/*
* Interrupts only occur in D3hot or shallower and only if enabled
@@ -554,6 +554,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id)
}
}
+read_status:
pcie_capability_read_word(pdev, PCI_EXP_SLTSTA, &status);
if (status == (u16) ~0) {
ctrl_info(ctrl, "%s: no response from device\n", __func__); @@
-566,24 +567,37 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id)
* Slot Status contains plain status bits as well as event
* notification bits; right now we only want the event bits.
*/
- events = status & (PCI_EXP_SLTSTA_ABP | PCI_EXP_SLTSTA_PFD |
- PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_CC |
- PCI_EXP_SLTSTA_DLLSC);
+ status &= PCI_EXP_SLTSTA_ABP | PCI_EXP_SLTSTA_PFD |
+ PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_CC |
+ PCI_EXP_SLTSTA_DLLSC;
/*
* If we've already reported a power fault, don't report it again
* until we've done something to handle it.
*/
if (ctrl->power_fault_detected)
- events &= ~PCI_EXP_SLTSTA_PFD;
+ status &= ~PCI_EXP_SLTSTA_PFD;
+ events |= status;
if (!events) {
if (parent)
pm_runtime_put(parent);
return IRQ_NONE;
}
- pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, events);
+ if (status) {
+ pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, events);
+
+ /*
+ * In MSI mode, all event bits must be zero before the port
+ * will send a new interrupt (PCIe Base Spec r5.0 sec 6.7.3.4).
+ * So re-read the Slot Status register in case a bit was set
+ * between read and write.
+ */
+ if (pci_dev_msi_enabled(pdev) && !pciehp_poll_mode)
+ goto read_status;
+ }
+
ctrl_dbg(ctrl, "pending interrupts %#06x from Slot Status\n", events);
if (parent)
pm_runtime_put(parent);