Periodic locking IO is causing server to stop responding

From: Ian Coetzee
Date: Mon Mar 29 2021 - 03:00:01 EST


Hi All,

We have run into a slight mishap here on one of our servers, which I am hoping you could help narrow down to a cause.

One of our servers locks up every so often, seemingly because of a disk IO lockup. Symptoms include high load average (106) stemming from the processor waiting around 97-99% for the disk. When this occurs any new ssh sessions is met with a connection timeout.

Kernel version: 5.10.24-uls #1 SMP Fri Mar 19 11:31:52 SAST 2021 x86_64 Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz GenuineIntel GNU/Linux

The following log entries appeared in dmesg around the time the last lock up started occurring.

[Tue Mar 23 06:09:32 2021] scsi target0:0:8: handle(0x0012), sas_address(0x500304800175f088), phy(8)
[Tue Mar 23 06:09:32 2021] scsi target0:0:8: enclosure logical id(0x500304800175f0bf), slot(8)
[Tue Mar 23 06:09:32 2021] sd 0:0:8:0: No reference found at driver, assuming scmd(0x0000000029a7ef73) might have completed
[Tue Mar 23 06:09:32 2021] sd 0:0:8:0: task abort: SUCCESS scmd(0x0000000029a7ef73)
[Tue Mar 23 06:09:32 2021] sd 0:0:8:0: attempting task abort!scmd(0x00000000de97f273), outstanding for 184620 ms & timeout 180000 ms
[Tue Mar 23 06:09:32 2021] sd 0:0:8:0: [sdi] tag#1993 CDB: Write(16) 8a 00 00 00 00 00 34 d5 2c 00 00 00 04 00 00 00
[Tue Mar 23 06:09:32 2021] scsi target0:0:8: handle(0x0012), sas_address(0x500304800175f088), phy(8)
[Tue Mar 23 06:09:32 2021] scsi target0:0:8: enclosure logical id(0x500304800175f0bf), slot(8)
[Tue Mar 23 06:09:32 2021] sd 0:0:8:0: No reference found at driver, assuming scmd(0x00000000de97f273) might have completed
[Tue Mar 23 06:09:32 2021] sd 0:0:8:0: task abort: SUCCESS scmd(0x00000000de97f273)
[Tue Mar 23 06:09:32 2021] sd 0:0:8:0: attempting task abort!scmd(0x00000000e4cfbc75), outstanding for 184600 ms & timeout 180000 ms
[Tue Mar 23 06:09:32 2021] sd 0:0:8:0: [sdi] tag#1987 CDB: Write(16) 8a 00 00 00 00 00 34 d5 96 a8 00 00 01 58 00 00
[Tue Mar 23 06:09:32 2021] scsi target0:0:8: handle(0x0012), sas_address(0x500304800175f088), phy(8)
[Tue Mar 23 06:09:32 2021] scsi target0:0:8: enclosure logical id(0x500304800175f0bf), slot(8)
[Tue Mar 23 06:09:32 2021] sd 0:0:8:0: No reference found at driver, assuming scmd(0x00000000e4cfbc75) might have completed
[Tue Mar 23 06:09:32 2021] sd 0:0:8:0: task abort: SUCCESS scmd(0x00000000e4cfbc75)
[Tue Mar 23 06:09:32 2021] sd 0:0:8:0: attempting task abort!scmd(0x000000002282f27d), outstanding for 184620 ms & timeout 180000 ms
[Tue Mar 23 06:09:32 2021] sd 0:0:8:0: [sdi] tag#1986 CDB: Write(16) 8a 00 00 00 00 00 34 d5 3c 00 00 00 04 00 00 00
[Tue Mar 23 06:09:32 2021] scsi target0:0:8: handle(0x0012), sas_address(0x500304800175f088), phy(8)
[Tue Mar 23 06:09:32 2021] scsi target0:0:8: enclosure logical id(0x500304800175f0bf), slot(8)
[Tue Mar 23 06:09:32 2021] sd 0:0:8:0: No reference found at driver, assuming scmd(0x000000002282f27d) might have completed
[Tue Mar 23 06:09:32 2021] sd 0:0:8:0: task abort: SUCCESS scmd(0x000000002282f27d)
[Tue Mar 23 06:09:32 2021] sd 0:0:8:0: device_unblock and setting to running, handle(0x0012)
[Tue Mar 23 06:09:33 2021] sd 0:0:8:0: Power-on or device reset occurred

We are running a bank of drives on software raid, all on controller

           *-storage
                description: Serial Attached SCSI controller
                product: SAS3008 PCI-Express Fusion-MPT SAS-3
                vendor: Broadcom / LSI
                physical id: 0
                bus info: pci@0000:01:00.0
                logical name: scsi0
                version: 02
                width: 64 bits
                clock: 33MHz
                capabilities: storage pm pciexpress vpd msi msix bus_master cap_list rom
                configuration: driver=mpt3sas latency=0
                resources: irq:24 ioport:e000(size=256) memory:fb200000-fb20ffff memory:fb100000-fb1fffff

So far we have not seen this on any of other servers on the same kernel version.

Please let me know if I can provide anymore information, we have since restarted the server.

Kind regards
Ian Coetzee