ata1.00: failed command: WRITE FPDMA QUEUED on new AMD AM4 MSI B350 Motherboard

From: Mark Hounschell
Date: Fri Jul 07 2017 - 15:11:48 EST


With both 4.11 and 4.12 kernels I get the following when doing heavy disk I/O, like a kernel build with "make -j 15". Even copying the kernel source tree from one place to another. The hardware is an MSI B350 Tomahawk Arctic MB with 16GB of memory and a Ryzen 1700 processor. The disk being used is a 160Gb Seagate ST3160815AS that has error free media according to "badblocks -w".

Jul 6 13:34:43 cpu0 kernel: ata1.00: exception Emask 0x11 SAct 0x7ffbffff SErr 0x400000 action 0x6 frozen
Jul 6 13:34:43 cpu0 kernel: ata1.00: irq_stat 0x48000008, interface fatal error
Jul 6 13:34:43 cpu0 kernel: ata1: SError: { Handshk }
Jul 6 13:34:43 cpu0 kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Jul 6 13:34:43 cpu0 kernel: ata1.00: cmd 61/08:00:57:89:90/00:00:03:00:00/40 tag 0 ncq dma 4096 out
res 40/00:b8:2f:ff:b3/00:00:02:00:00/40 Emask 0x10 (ATA bus error)
Jul 6 13:34:43 cpu0 kernel: ata1.00: status: { DRDY }
Jul 6 13:34:43 cpu0 kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Jul 6 13:34:43 cpu0 kernel: ata1.00: cmd 61/08:08:87:89:90/00:00:03:00:00/40 tag 1 ncq dma 4096 out
res 40/00:b8:2f:ff:b3/00:00:02:00:00/40 Emask 0x10 (ATA bus error)
Jul 6 13:34:43 cpu0 kernel: ata1.00: status: { DRDY }
Jul 6 13:34:43 cpu0 kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Jul 6 13:34:43 cpu0 kernel: ata1.00: cmd 61/20:10:97:89:90/00:00:03:00:00/40 tag 2 ncq dma 16384 out
res 40/00:b8:2f:ff:b3/00:00:02:00:00/40 Emask 0x10 (ATA bus error)

When I set the kernel cmdline option libata.force=noncq, the messages change into:

[ 1724.372101] ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x400000 action 0x6 frozen
[ 1724.375888] ata1.00: irq_stat 0x48000001, interface fatal error
[ 1724.379721] ata1: SError: { Handshk }
[ 1724.383691] ata1.00: failed command: WRITE DMA EXT
[ 1724.383695] ata1.00: cmd 35/00:50:67:0d:e4/00:09:02:00:00/e0 tag 10 dma 1220608 out
res 51/84:50:67:0d:e4/00:09:02:00:00/e0 Emask 0x10 (ATA bus error)
[ 1724.383699] ata1.00: status: { DRDY ERR }
[ 1724.383700] ata1.00: error: { ICRC ABRT }
[ 1724.383706] ata1: hard resetting link
[ 1724.850060] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[ 1724.959883] ata1.00: configured for UDMA/133
[ 1724.959910] ata1: EH complete
[ 1921.704356] ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x400000 action 0x6 frozen
[ 1921.708292] ata1.00: irq_stat 0x48000001, interface fatal error
[ 1921.712210] ata1: SError: { Handshk }
[ 1921.716294] ata1.00: failed command: WRITE DMA EXT
[ 1921.716297] ata1.00: cmd 35/00:90:ef:93:86/00:03:02:00:00/e0 tag 18 dma 466944 out
res 51/84:90:ef:93:86/00:03:02:00:00/e0 Emask 0x10 (ATA bus error)
[ 1921.716298] ata1.00: status: { DRDY ERR }
[ 1921.716298] ata1.00: error: { ICRC ABRT }
[ 1921.716303] ata1: hard resetting link
[ 1922.175312] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[ 1922.284165] ata1.00: configured for UDMA/133
[ 1922.288602] ata1: EH complete


smartctl shows no issues with the drive. In fact I can take this very drive
and install it an an AM3 machine and everything works just fine. I have
also installed a PCI-e Sata card and connected the drive to that and that
works just fine also.

So I have either a linux kernel problem or a hardware problem on
this brand new AM4 motherboard. I don't really know what it
is other than it is something related with the AMD B350 chipset.

It is a fairly new chip set so I am suspicious of the kernel.

# smartctl -a /dev/sda
smartctl 6.2 2013-11-07 r3856 [x86_64-linux-4.11.6-lcrs] (SUSE RPM)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.10
Device Model: ST3160815AS
Serial Number: 6RACD737
Firmware Version: 4.AAB
User Capacity: 160,041,885,696 bytes [160 GB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA/ATAPI-7 (minor revision not indicated)
Local Time is: Fri Jul 7 13:50:50 2017 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 430) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 54) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 100 253 006 Pre-fail Always - 0
3 Spin_Up_Time 0x0003 097 097 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 416
5 Reallocated_Sector_Ct 0x0033 099 099 036 Pre-fail Always - 68
7 Seek_Error_Rate 0x000f 080 060 030 Pre-fail Always - 100916113
9 Power_On_Hours 0x0032 046 046 000 Old_age Always - 48052
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 416
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 070 066 045 Old_age Always - 30 (Min/Max 26/30)
194 Temperature_Celsius 0x0022 030 034 000 Old_age Always - 30 (0 22 0 0 0)
195 Hardware_ECC_Recovered 0x001a 079 065 000 Old_age Always - 168805116
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 46
200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0
202 Data_Address_Mark_Errs 0x0032 100 253 000 Old_age Always - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


Any pointers would be greatly appreciated.

Regards
Mark