[PATCH V2]nvme-pci: Fixes EEH failure on ppc

From: wenxiong
Date: Wed Feb 07 2018 - 15:16:42 EST

Next message: Mario.Limonciello: "RE: [PATCH v2] platform/x86: dell-laptop: Allocate buffer on heap rather than globally"
Previous message: Tyler Baicar: "[PATCH v2] PCI/AER: update AER status string print to match other AER logs"
Next in thread: Keith Busch: "Re: [PATCH V2]nvme-pci: Fixes EEH failure on ppc"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

From: Wen Xiong <wenxiong@xxxxxxxxxxxxxxxxxx>

With b2a0eb1a0ac72869c910a79d935a0b049ec78ad9(nvme-pci: Remove watchdog
timer), EEH recovery stops working on ppc.

After removing whatdog timer routine, when trigger EEH on ppc, we hit
EEH in nvme_timeout(). We would like to check if pci channel is offline
or not at the beginning of nvme_timeout(), if it is already offline,
we don't need to do future nvme timeout process.

With the patch, EEH recovery works successfuly on ppc.

Signed-off-by: Wen Xiong <wenxiong@xxxxxxxxxxxxxxxxxx>

[ 232.585495] EEH: PHB#3 failure detected, location: N/A
[ 232.585545] CPU: 8 PID: 4873 Comm: kworker/8:1H Not tainted
4.14.0-6.el7a.ppc64le #1
[ 232.585646] Workqueue: kblockd blk_mq_timeout_work
[ 232.585705] Call Trace:
[ 232.585743] [c000003f7a533940] [c000000000c3556c]
dump_stack+0xb0/0xf4 (unreliable)
[ 232.585823] [c000003f7a533980] [c000000000043eb0]
eeh_check_failure+0x290/0x630
[ 232.585924] [c000003f7a533a30] [c008000011063f30]
nvme_timeout+0x1f0/0x410 [nvme]
[ 232.586038] [c000003f7a533b00] [c000000000637fc8]
blk_mq_check_expired+0x118/0x1a0
[ 232.586134] [c000003f7a533b80] [c00000000063e65c]
bt_for_each+0x11c/0x200
[ 232.586191] [c000003f7a533be0] [c00000000063f1f8]
blk_mq_queue_tag_busy_iter+0x78/0x110
[ 232.586272] [c000003f7a533c30] [c0000000006367b8]
blk_mq_timeout_work+0xa8/0x1c0
[ 232.586351] [c000003f7a533c80] [c00000000015d5ec]
process_one_work+0x1bc/0x5f0
[ 232.586431] [c000003f7a533d20] [c00000000016060c]
worker_thread+0xac/0x6b0
[ 232.586485] [c000003f7a533dc0] [c00000000016a528] kthread+0x168/0x1b0
[ 232.586539] [c000003f7a533e30] [c00000000000b4e8]
ret_from_kernel_thread+0x5c/0x74
[ 232.586640] nvme nvme0: I/O 10 QID 0 timeout, reset controller
[ 232.586640] EEH: Detected error on PHB#3
[ 232.586642] EEH: This PCI device has failed 1 times in the last hour
[ 232.586642] EEH: Notify device drivers to shutdown
[ 232.586645] nvme nvme0: frozen state error detected, reset controller
[ 234.098667] EEH: Collect temporary log
[ 234.098694] PHB4 PHB#3 Diag-data (Version: 1)
[ 234.098728] brdgCtl: 00000002
[ 234.098748] RootSts: 00070020 00402000 c1010008 00100107 00000000
[ 234.098807] RootErrSts: 00000000 00000020 00000001
[ 234.098878] nFir: 0000800000000000 0030001c00000000
0000800000000000
[ 234.098937] PhbSts: 0000001800000000 0000001800000000
[ 234.098990] Lem: 0000000100000100 0000000000000000
0000000100000000
[ 234.099067] PhbErr: 000004a000000000 0000008000000000
2148000098000240 a008400000000000
[ 234.099140] RxeMrgErr: 0000000000000001 0000000000000001
0000000000000000 0000000000000000
[ 234.099250] PcieDlp: 0000000000000000 0000000000000000
8000000000000000
[ 234.099326] RegbErr: 00d0000010000000 0000000010000000
8800005800000000 0000000007011000
[ 234.099418] EEH: Reset without hotplug activity
[ 237.317675] nvme 0003:01:00.0: Refused to change power state,
currently in D3
[ 237.317740] nvme 0003:01:00.0: Using 64-bit DMA iommu bypass
[ 237.317797] nvme nvme0: Removing after probe failure status: -19
[ 361.139047689,3] PHB#0003[0:3]: Escalating freeze to fence
PESTA[0]=a440002a01000000
[ 237.617706] EEH: Notify device drivers the completion of reset
[ 237.617754] nvme nvme0: restart after slot reset
[ 237.617834] EEH: Notify device driver to resume
[ 238.777746] nvme0n1: detected capacity change from 24576000000 to 0
[ 238.777841] nvme0n2: detected capacity change from 24576000000 to 0
[ 238.777944] nvme0n3: detected capacity change from 24576000000 to 0
[ 238.778019] nvme0n4: detected capacity change from 24576000000 to 0
[ 238.778132] nvme0n5: detected capacity change from 24576000000 to 0
[ 238.778222] nvme0n6: detected capacity change from 24576000000 to 0
[ 238.778314] nvme0n7: detected capacity change from 24576000000 to 0
[ 238.778416] nvme0n8: detected capacity change from 24576000000 to 0
---
drivers/nvme/host/pci.c | 12 ++++++------
1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6fe7af0..4809f3d 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1153,12 +1153,6 @@ static bool nvme_should_reset(struct nvme_dev *dev, u32 csts)
if (!(csts & NVME_CSTS_CFS) && !nssro)
return false;

- /* If PCI error recovery process is happening, we cannot reset or
- * the recovery mechanism will surely fail.
- */
- if (pci_channel_offline(to_pci_dev(dev->dev)))
- return false;
-
return true;
}

@@ -1189,6 +1183,12 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
struct nvme_command cmd;
u32 csts = readl(dev->bar + NVME_REG_CSTS);

+ /* If PCI error recovery process is happening, we cannot reset or
+ * the recovery mechanism will surely fail.
+ */
+ if (pci_channel_offline(to_pci_dev(dev->dev)))
+ return BLK_EH_RESET_TIMER;
+
/*
* Reset immediately if the controller is failed
*/
--
1.7.1

Next message: Mario.Limonciello: "RE: [PATCH v2] platform/x86: dell-laptop: Allocate buffer on heap rather than globally"
Previous message: Tyler Baicar: "[PATCH v2] PCI/AER: update AER status string print to match other AER logs"
Next in thread: Keith Busch: "Re: [PATCH V2]nvme-pci: Fixes EEH failure on ppc"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]