RE: [RFC PATCH 2/2] PCI:hip08:Add driver to handle HiSilicon hip08 PCIe controller's errors

From: Shiju Jose
Date: Fri Jan 17 2020 - 04:41:00 EST


Hi Bjorn,

Thanks for reviewing the patch.
Please find reply inline.

>-----Original Message-----
>From: Bjorn Helgaas [mailto:helgaas@xxxxxxxxxx]
>Sent: 15 January 2020 14:14
>To: Shiju Jose <shiju.jose@xxxxxxxxxx>
>Cc: linux-acpi@xxxxxxxxxxxxxxx; linux-pci@xxxxxxxxxxxxxxx; linux-
>kernel@xxxxxxxxxxxxxxx; rjw@xxxxxxxxxxxxx; lenb@xxxxxxxxxx; bp@xxxxxxxxx;
>james.morse@xxxxxxx; tony.luck@xxxxxxxxx; gregkh@xxxxxxxxxxxxxxxxxxx;
>zhangliguang@xxxxxxxxxxxxxxxxx; tglx@xxxxxxxxxxxxx; Linuxarm
><linuxarm@xxxxxxxxxx>; Jonathan Cameron
><jonathan.cameron@xxxxxxxxxx>; tanxiaofei <tanxiaofei@xxxxxxxxxx>;
>yangyicong <yangyicong@xxxxxxxxxx>
>Subject: Re: [RFC PATCH 2/2] PCI:hip08:Add driver to handle HiSilicon hip08
>PCIe controller's errors
>
>Follow convention for subject line.

Ok. We will ix it.

>
>On Wed, Jan 15, 2020 at 11:01:40AM +0000, Shiju Jose wrote:
>> From: Yicong Yang <yangyicong@xxxxxxxxxxxxx>
>>
>> The hip08 error handle driver logs and reports PCIe controller's
>> recoverable errors.
>> Perform root port reset and restore link status for the recovery.
>>
>> Following are some of the PCIe controller's recoverable errors 1.
>> completion transmission timeout error.
>> 2. CRS retry counter over the threshold error.
>> 3. ECC 2 bit errors
>> 4. AXI bresponse/rresponse errors etc.
>>
>> RFC: The appropriate location for this driver may be discussed.
>>
>> Signed-off-by: Yicong Yang <yangyicong@xxxxxxxxxxxxx>
>> Signed-off-by: Shiju Jose <shiju.jose@xxxxxxxxxx>
>> ---
>> drivers/pci/controller/Kconfig | 8 +
>> drivers/pci/controller/Makefile | 1 +
>> drivers/pci/controller/pcie-hisi-hip08-error.c | 323
>> +++++++++++++++++++++++++
>
>Seems like this driver and its Kconfig should be near
>drivers/pci/controller/dwc/pcie-hisi.c.

pcie-hisi.c was for our old hip* devices.
Our hip08 PCIe controller doesn't use DWC ip.
Thus the driver to be in /drivers/pci/controller/ ?

>
>> 3 files changed, 332 insertions(+)
>> create mode 100644 drivers/pci/controller/pcie-hisi-hip08-error.c
>>
>> diff --git a/drivers/pci/controller/Kconfig
>> b/drivers/pci/controller/Kconfig index c77069c..0ee99b8 100644
>> --- a/drivers/pci/controller/Kconfig
>> +++ b/drivers/pci/controller/Kconfig
>> @@ -260,6 +260,14 @@ config PCI_HYPERV_INTERFACE
>> The Hyper-V PCI Interface is a helper driver allows other drivers to
>> have a common interface with the Hyper-V PCI frontend driver.
>>
>> +config PCIE_HISI_HIP08_ERR_HANDLER
>
>Config symbol is too long, maybe CONFIG_PCI_HISI_ERR or similar (to be
>parallel with existing CONFIG_PCI_HISI). Both should probably be
>"CONFIG_PCIE", not "CONFIG_PCI". I can't remember why CONFIG_PCI_HISI is
>that way.

Ok. We will change it to CONFIG_PCI_HISI_ERR.

>
>> + depends on ARM64 || COMPILE_TEST
>> + depends on (ACPI && PCI_QUIRKS)
>> + bool "HiSilicon hip08 Soc PCIe local error handling driver"
>> + help
>> + Say Y here if you want PCIe error handling support
>> + for the PCIe local(controller) errors on HiSilicon hip08 SoC
>> +
>> source "drivers/pci/controller/dwc/Kconfig"
>> source "drivers/pci/controller/cadence/Kconfig"
>> endmenu
>> diff --git a/drivers/pci/controller/Makefile
>> b/drivers/pci/controller/Makefile index 3d4f597..ac9852f 100644
>> --- a/drivers/pci/controller/Makefile
>> +++ b/drivers/pci/controller/Makefile
>> @@ -28,6 +28,7 @@ obj-$(CONFIG_PCIE_MEDIATEK) += pcie-mediatek.o
>> obj-$(CONFIG_PCIE_MOBIVEIL) += pcie-mobiveil.o
>> obj-$(CONFIG_PCIE_TANGO_SMP8759) += pcie-tango.o
>> obj-$(CONFIG_VMD) += vmd.o
>> +obj-$(CONFIG_PCIE_HISI_HIP08_ERR_HANDLER) += pcie-hisi-hip08-error.o
>> # pcie-hisi.o quirks are needed even without CONFIG_PCIE_DW
>> obj-y += dwc/
>>
>> diff --git a/drivers/pci/controller/pcie-hisi-hip08-error.c
>> b/drivers/pci/controller/pcie-hisi-hip08-error.c
>> new file mode 100644
>> index 0000000..6f5d002
>> --- /dev/null
>> +++ b/drivers/pci/controller/pcie-hisi-hip08-error.c
>> @@ -0,0 +1,323 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * PCIe driver for handling PCIe local errors occurred on
>> + * HiSilicon hip08 PCIe controller.
>
>"PCIe" occurs too many times in this sentence. Strictly speaking this is not a
>"PCIe driver"; it's a driver for an ACPI device that reports hip08-related errors.

Ok. Will remove it.

>Hopefully we don't need a separate driver for every hip* device, so maybe the
>"hip08" name is too specific.

Currently we are using this single driver for all hip08 device. so maybe this is better:
yes, a separate driver is unnecessary. We'll remove "hip08" name and
make it more generic for hip* device.

>
>> + * Copyright (c) 2018-2019 Hisilicon Limited.
>
>Capitalize "HiSilicon" consistently. Also "hip08"; previous practice in
>drivers/pci is "Hip05", "Hip06", so use that unless HiSilicon itself does it
>differently.

Ok. We will fix.

>
>> +#include <linux/acpi.h>
>> +#include <acpi/ghes.h>
>> +#include <linux/bitfield.h>
>> +#include <linux/bitops.h>
>> +#include <linux/delay.h>
>> +#include <linux/irq.h>
>> +#include <linux/irqdomain.h>
>> +#include <linux/list.h>
>> +#include <linux/pci.h>
>> +#include <linux/platform_device.h>
>> +#include <linux/resource.h>
>> +#include <linux/kfifo.h>
>> +#include <linux/spinlock.h>
>> +
>> +#include "../pci.h"
>> +
>> +#define HISI_PCIE_ERR_RECOVER_RING_SIZE 16
>> +#define HISI_PCIE_ERR_INFO_SIZE 1024
>> +
>> +/* HISI PCIe Local error definitions */
>> +#define HISI_PCIE_ERR_MISC_REGS 33
>> +
>> +#define HISI_PCIE_SUB_MODULE_ID_AP 0
>> +#define HISI_PCIE_SUB_MODULE_ID_TL 1
>> +#define HISI_PCIE_SUB_MODULE_ID_MAC 2
>> +#define HISI_PCIE_SUB_MODULE_ID_DL 3
>> +#define HISI_PCIE_SUB_MODULE_ID_SDI 4
>> +
>> +#define HISI_PCIE_LOCAL_VALID_VERSION BIT(0)
>> +#define HISI_PCIE_LOCAL_VALID_SOC_ID BIT(1)
>> +#define HISI_PCIE_LOCAL_VALID_SOCKET_ID BIT(2)
>> +#define HISI_PCIE_LOCAL_VALID_NIMBUS_ID BIT(3)
>> +#define HISI_PCIE_LOCAL_VALID_SUB_MODULE_ID BIT(4)
>> +#define HISI_PCIE_LOCAL_VALID_CORE_ID BIT(5)
>> +#define HISI_PCIE_LOCAL_VALID_PORT_ID BIT(6)
>> +#define HISI_PCIE_LOCAL_VALID_ERR_TYPE BIT(7)
>> +#define HISI_PCIE_LOCAL_VALID_ERR_SEVERITY BIT(8)
>> +#define HISI_PCIE_LOCAL_VALID_ERR_MISC 9
>> +
>> +#define HISI_ERR_SEV_RECOVERABLE 0
>> +#define HISI_ERR_SEV_FATAL 1
>> +#define HISI_ERR_SEV_CORRECTED 2
>> +#define HISI_ERR_SEV_NONE 3
>> +
>> +guid_t hip08_pcie_sec_type = GUID_INIT(0xB2889FC9, 0xE7D7, 0x4F9D,
>0xA8, 0x67,
>> + 0xAF, 0x42, 0xE9, 0x8B, 0xE7, 0x72);
>> +
>> +#define HISI_PCIE_CORE_ID(v) ((v) >> 3)
>> +#define HISI_PCIE_PORT_ID(core, v) ((v >> 1) + (core << 3))
>> +#define HISI_PCIE_CORE_PORT_ID(v) ((v % 8) << 1)
>> +#define HISI_PCIE_ROOT_BUSNR(v) ((v) ? 0x80 : 0)
>
>Is the root bus number really hard-wired in the chip? You're saying the only
>possible root bus numbers are 0x80 and 0x00? Typically this bus number is
>programmable.

We will fix it and remove the macro. We'll get the root bus number from acpi instead.

>
>Why parens around "v" (sometimes) but not others and "core"?

We will correct it.

>
>> +struct hisi_pcie_local_err_data {
>> + uint64_t val_bits;
>> + uint8_t version;
>> + uint8_t soc_id;
>> + uint8_t socket_id;
>> + uint8_t nimbus_id;
>> + uint8_t sub_module_id;
>> + uint8_t core_id;
>> + uint8_t port_id;
>> + uint8_t err_severity;
>> + uint16_t err_type;
>> + uint8_t reserv[2];
>> + uint32_t err_misc[HISI_PCIE_ERR_MISC_REGS];
>
>Use u64, u8, u32 throughout instead.

We will change it.

>
>> +};
>> +
>> +struct pcie_err_info {
>> + struct hisi_pcie_local_err_data err_data;
>> + struct platform_device *pdev;
>> +};
>> +
>> +static char *pcie_local_sub_module_name(uint8_t id) {
>> + switch (id) {
>> + case HISI_PCIE_SUB_MODULE_ID_AP: return "AP Layer";
>> + case HISI_PCIE_SUB_MODULE_ID_TL: return "TL Layer";
>> + case HISI_PCIE_SUB_MODULE_ID_MAC: return "MAC Layer";
>> + case HISI_PCIE_SUB_MODULE_ID_DL: return "DL Layer";
>> + case HISI_PCIE_SUB_MODULE_ID_SDI: return "SDI Layer";
>> + }
>> +
>> + return "unknown";
>> +}
>> +
>> +static char *err_severity(uint8_t err_sev) {
>> + switch (err_sev) {
>> + case HISI_ERR_SEV_RECOVERABLE: return "recoverable";
>> + case HISI_ERR_SEV_FATAL: return "fatal";
>> + case HISI_ERR_SEV_CORRECTED: return "corrected";
>> + case HISI_ERR_SEV_NONE: return "none";
>> + }
>> +
>> + return "unknown";
>> +}
>> +
>> +static struct pci_dev *hisi_hip08_pcie_get_rp(u32 chip_id, u32
>> +port_id) {
>> + u32 devfn = PCI_DEVFN(port_id, 0);
>> + u32 busnr = HISI_PCIE_ROOT_BUSNR(chip_id);
>> +
>> + return pci_get_domain_bus_and_slot(0, busnr, devfn); }
>> +
>> +static int hisi_hip08_pcie_port_acpi_reset(struct platform_device *pdev,
>> + u32 chip_id, u32 port_id)
>> +{
>> + struct device *dev = &(pdev->dev);
>
>Unnecessary parens. More occurrences below.

We will change it.

>
>> + struct acpi_object_list arg_list;
>> + union acpi_object arg[3];
>> +
>> + arg[0].type = ACPI_TYPE_INTEGER;
>> + arg[0].integer.value = chip_id;
>> + arg[1].type = ACPI_TYPE_INTEGER;
>> + arg[1].integer.value = HISI_PCIE_CORE_ID(port_id);
>> + arg[2].type = ACPI_TYPE_INTEGER;
>> + arg[2].integer.value = HISI_PCIE_CORE_PORT_ID(port_id);
>> +
>> + arg_list.count = 3;
>> + arg_list.pointer = arg;
>> + /* Call the ACPI handle to reset root port */
>
>s/root port /root port /
>
>> + if (ACPI_HANDLE(dev)) {
>
>Restructure this to return early for error and unindent the following, e.g.,

Ok. We will change it.

>
> acpi_handle handle = ACPI_HANDLE(dev);
>
> if (!handle) {
> dev_err(...);
> return -EINVAL;
> }
>
> arg[0].type = ACPI_TYPE_INTEGER;
> ...
> s = acpi_evaluate_integer(handle, ...);
>
>> + unsigned long long data = 0;
>> + acpi_status s;
>> +
>> + s = acpi_evaluate_integer(ACPI_HANDLE(dev),
>> + "RST", &arg_list, &data);
>> +
>> + if (ACPI_FAILURE(s)) {
>> + dev_err(dev, "No Reset method\n");
>> + return -EIO;
>> + }
>> +
>> + if (data) {
>> + dev_err(dev, "Failed to Reset\n");
>> + return -EIO;
>> + }
>> +
>> + } else {
>> + dev_err(dev, "No Reset method\n");
>> + return -EINVAL;
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +static int hisi_hip08_pcie_port_reset(struct platform_device *dev,
>> + u32 chip_id, u32 port_id)
>> +{
>> + struct pci_dev *pdev;
>> + struct pci_bus *root_bus;
>> +
>> + pdev = hisi_hip08_pcie_get_rp(chip_id, port_id);
>> + if (!pdev) {
>> + dev_info(&(dev->dev), "Fail to get root port device\n");
>> + return -ENODEV;
>> + }
>> + root_bus = pdev->bus;
>> +
>> + pci_stop_and_remove_bus_device_locked(pdev);
>> +
>> + if (hisi_hip08_pcie_port_acpi_reset(dev, chip_id, port_id))
>> + return -EIO;
>> + ssleep(1UL);
>
>Please include a comment that cites the spec section that requires this sleep.

we'll add a comment. We use a 1s delay here as does in pci_reset_secondary_bus().
It's necessary for re-initialization of subordinate devices.

>
>> +
>> + /* add root port and downstream devices */
>> + pci_lock_rescan_remove();
>> + pci_rescan_bus(root_bus);
>> + pci_unlock_rescan_remove();
>> +
>> + return 0;
>> +}
>> +
>> +static void pcie_local_error_handle(const struct hisi_pcie_local_err_data
>*err,
>> + struct platform_device *pdev) {
>> + char buf[HISI_PCIE_ERR_INFO_SIZE];
>> + char *p = buf, *end = buf + sizeof(buf);
>> + struct device *dev = &(pdev->dev);
>> + uint32_t i;
>> + int rc;
>> +
>> + if (err->val_bits == 0) {
>> + dev_warn(dev, "%s: no valid error information\n", __func__);
>> + return;
>> + }
>> +
>> + /* Logging */
>> + p += snprintf(p, end - p, "[ Table version=%d ", err->version);
>> + if (err->val_bits & HISI_PCIE_LOCAL_VALID_SOC_ID)
>> + p += snprintf(p, end - p, "SOC ID=%d ", err->soc_id);
>> +
>> + if (err->val_bits & HISI_PCIE_LOCAL_VALID_SOCKET_ID)
>> + p += snprintf(p, end - p, "socket ID=%d ", err->socket_id);
>> +
>> + if (err->val_bits & HISI_PCIE_LOCAL_VALID_NIMBUS_ID)
>> + p += snprintf(p, end - p, "nimbus ID=%d ", err->nimbus_id);
>> +
>> + if (err->val_bits & HISI_PCIE_LOCAL_VALID_SUB_MODULE_ID)
>> + p += snprintf(p, end - p, "sub module=%s ",
>> + pcie_local_sub_module_name(err-
>>sub_module_id));
>> +
>> + if (err->val_bits & HISI_PCIE_LOCAL_VALID_CORE_ID)
>> + p += snprintf(p, end - p, "core ID=core%d ", err->core_id);
>> +
>> + if (err->val_bits & HISI_PCIE_LOCAL_VALID_PORT_ID)
>> + p += snprintf(p, end - p, "port ID=port%d ", err->port_id);
>> +
>> + if (err->val_bits & HISI_PCIE_LOCAL_VALID_ERR_SEVERITY)
>> + p += snprintf(p, end - p, "error severity=%s ",
>> + err_severity(err->err_severity));
>> +
>> + if (err->val_bits & HISI_PCIE_LOCAL_VALID_ERR_TYPE)
>> + p += snprintf(p, end - p, "error type=0x%x ", err->err_type);
>> +
>> + p += snprintf(p, end - p, "]\n");
>> + dev_info(dev, "\nHISI HIP08: PCIe local error\n");
>> + dev_info(dev, "%s\n", buf);
>> +
>> + dev_info(dev, "Reg Dump:\n");
>> + for (i = 0; i < HISI_PCIE_ERR_MISC_REGS; i++) {
>> + if (err->val_bits & BIT(HISI_PCIE_LOCAL_VALID_ERR_MISC + i))
>> + dev_info(dev,
>> + "ERR_MISC_%d=0x%x\n", i, err->err_misc[i]);
>> + }
>> +
>> + /* Recovery for the PCIe local errors */
>> + if (err->err_severity == HISI_ERR_SEV_RECOVERABLE) {
>> + /* try reset PCI port for the error recovery */
>> + rc = hisi_hip08_pcie_port_reset(pdev, err->socket_id,
>> + HISI_PCIE_PORT_ID(err->core_id, err-
>>port_id));
>> + if (rc) {
>> + dev_info(dev, "fail to do hip08 pcie port reset\n");
>> + return;
>> + }
>> + }
>> +}
>> +
>> +static DEFINE_KFIFO(pcie_err_recover_ring, struct pcie_err_info,
>> + HISI_PCIE_ERR_RECOVER_RING_SIZE); static
>> +DEFINE_SPINLOCK(pcie_err_recover_ring_lock);
>> +
>> +static void pcie_local_err_recover_work_func(struct work_struct
>> +*work) {
>> + struct pcie_err_info pcie_err_entry;
>> +
>> + while (kfifo_get(&pcie_err_recover_ring, &pcie_err_entry)) {
>> + pcie_local_error_handle(&pcie_err_entry.err_data,
>> + pcie_err_entry.pdev);
>> + }
>> +}
>> +
>> +static DECLARE_WORK(pcie_err_recover_work,
>> +pcie_local_err_recover_work_func);
>> +
>> +
>> +static void hip08_pcie_local_error_handle(struct acpi_hest_generic_data
>*gdata,
>> + int sev, void *data)
>> +{
>> + const struct hisi_pcie_local_err_data *err_data =
>> + acpi_hest_get_payload(gdata);
>> + struct pcie_err_info err_info;
>> + struct platform_device *pdev = data;
>> + struct device *dev = &(pdev->dev);
>> +
>> + memcpy(&err_info.err_data, err_data, sizeof(*err_data));
>> + err_info.pdev = pdev;
>> +
>> + if (kfifo_in_spinlocked(&pcie_err_recover_ring, &err_info, 1,
>> + &pcie_err_recover_ring_lock))
>> + schedule_work(&pcie_err_recover_work);
>> + else
>> + dev_warn(dev, "Buffer overflow when recovering PCIe local
>> +error\n");
>
>I'd call this a "queue full" warning or similar. "Buffer overflow"
>suggests that we wrote past the end of a buffer and corrupted some memory,
>but that's not the case here.

We will change it to "queue full".

>
>> +}
>> +
>> +static int hisi_hip08_pcie_err_handler_probe(struct platform_device
>> +*pdev) {
>> + int ret = 0;
>
>Pointless local variable; maybe you meant to return failure if
>ghes_register_event_handler() fails? Don't initialize unless it's necessary.

We will fix it. We need to return error value on ghes_register_event_handler failure.

>
>> +
>> + if (ghes_register_event_handler(hip08_pcie_sec_type,
>> + hip08_pcie_local_error_handle,
>> + pdev)) {
>> + dev_err(&(pdev->dev), "%s : ghes_register_event_handler
>fail\n",
>> + __func__);
>> + return ret;
>> +}
>
>Indentation error.

Ok. we will correct it.

>
>> +
>> + return 0;
>> +}
>> +
>> +static int hisi_hip08_pcie_err_handler_remove(struct platform_device
>> +*pdev) {
>> + ghes_unregister_event_handler(hip08_pcie_sec_type);
>> +
>> + return 0;
>> +}
>> +
>> +static const struct acpi_device_id hip08_pcie_acpi_match[] = {
>> + { "HISI0361", 0 },
>> + { }
>> +};
>> +
>> +static struct platform_driver hisi_hip08_pcie_err_handler_driver = {
>> + .driver = {
>> + .name = "hisi-hip08-pcie-err-handler",
>> + .acpi_match_table = hip08_pcie_acpi_match,
>> + },
>> + .probe = hisi_hip08_pcie_err_handler_probe,
>> + .remove = hisi_hip08_pcie_err_handler_remove,
>> +};
>> +module_platform_driver(hisi_hip08_pcie_err_handler_driver);
>> +
>> +MODULE_DESCRIPTION("HiSilicon HIP08 PCIe controller error handling
>> +driver"); MODULE_LICENSE("GPL v2");
>> +
>> --
>> 1.9.1
>>
>>

Thanks,
Shiju