Re: Reproduceable SATA lockup on 3.7.8 with SSD

From: Mathieu Desnoyers
Date: Mon Feb 25 2013 - 20:02:41 EST


* Marc MERLIN (marc@xxxxxxxxxxx) wrote:
> Howdy,
>
> I seem to have the same problem (or similar) as Mathieu Desnoyers in
> https://lkml.org/lkml/2013/2/22/437
>
> I can reliably get my SSD to drop from the SATA bus given the right workload
> on linux.
>
> How can I tell if it's linux's fault of the drive's fault?

Here is a pseudo-git-blame checklist that might be useful for accurate
finger-pointing when a drive fails:

- try diagnostic tools from your drive vendor, if it reports your drive
as bad, then it might just be your drive failing,
- try to run a SMART test from smartmontools,
- try to reproduce your issue with a simple test-case (trying my test
program might help) that clearly fails quickly, and all the time, on
your problematic hardware,
- find out if there are known firmware upgrades for your drive provided
by your vendor, try them out,
- find out if there are known BIOS upgrades for your machine provided by
your vendor, try them out,
- try test-case on various kernel versions,
- try test-case on various distributions (just in case),
- try test-case with power management disabled in your machine's BIOS,
- try test-case with other SSD drives of the exact same model as
yours, so you can see if it's just you own drive failing,
- try moving your drive to a different machine (same model, different
model), and see if the test-case still fails,
- try with other SSD drives (from other vendors) on your machine,
- check if you partition mount options enable TRIM or not, try to
disable TRIM explicitly (see mount(8), discard/nodiscard option),
- try using a different filesystem (just in case),
- try using a different block I/O scheduler,
- try using your drive vendor's SSD eraser, to reinitialize your entire
disk (yes, you will lose you entire data). This might be useful if
TRIM handling has changed after a firmware upgrade for instance.

With all those results in hand, it should become easier to identify the
cause of your problem. My personal research currently indicate that all
the Intel SSDSC2BW180A3L drives found in Lenovo x230 laptops I have
tested so far (4 different laptops) all fail after a couple of minutes
with my simple random-access-write workload. Moving the drives into a
different laptop (x200) does not help (it still fails).

Good luck!

Mathieu

>
> Thanks,
> Marc
>
> ----- Forwarded message from Marc MERLIN <marc@xxxxxxxxxxx> -----
>
> From: Marc MERLIN <marc@xxxxxxxxxxx>
> To: linux-ide@xxxxxxxxxxxxxxx
>
> Hopefully this is the right list. I know that IDE!=SATA, but I can't find
> a SATA list.
> Please redirect me if needed.
>
> Hardware:
> Lenovo T530, 64bit kernel and userland.
> Hadware is shown below, but 2 drives, one SSD (OCZ-VERTEX4) and one HD (Hitachi HTS54101).
>
> The SSD will lockup reliably if I do a specific mencoder command that reads MP4
> files and rewrites them to another file in the same directory.
>
> The log of what happens is shown below, the drive is eventually taken off the bus.
> Once I reboot, it back, as if nothing happened.
> If I do the same command on the HD, it works, but of course timings will be different
> since the HD is slower.
>
> How can I tell if it's the SSD's firmware's fault, or the linux SATA/AHCI code
> that is buggy?
>
> Thanks,
> Marc
>
> Failure log:
> ata1.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x6 frozen
> ata1.00: failed command: WRITE FPDMA QUEUED
> ata1.00: cmd 61/00:00:00:38:13/04:00:33:00:00/40 tag 0 ncq 524288 out
> res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
> ata1.00: status: { DRDY }
> ata1.00: failed command: WRITE FPDMA QUEUED
> ata1.00: cmd 61/00:08:00:3c:13/04:00:33:00:00/40 tag 1 ncq 524288 out
> res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> ata1.00: status: { DRDY }
> (snipped)
> ata1.00: failed command: WRITE FPDMA QUEUED
> ata1.00: cmd 61/00:e8:00:30:13/04:00:33:00:00/40 tag 29 ncq 524288 out
> res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> ata1.00: status: { DRDY }
> ata1.00: failed command: WRITE FPDMA QUEUED
> ata1.00: cmd 61/00:f0:00:34:13/04:00:33:00:00/40 tag 30 ncq 524288 out
> res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> ata1.00: status: { DRDY }
> ata1: hard resetting link
> ata1: link is slow to respond, please be patient (ready=0)
> ata1: COMRESET failed (errno=-16)
> ata1: hard resetting link
> ata1: link is slow to respond, please be patient (ready=0)
> ata1: COMRESET failed (errno=-16)
> ata1: hard resetting link
> ata1: link is slow to respond, please be patient (ready=0)
> ata1: COMRESET failed (errno=-16)
> ata1: limiting SATA link speed to 3.0 Gbps
> ata1: hard resetting link
> ata1: COMRESET failed (errno=-16)
> ata1: reset failed, giving up
> ata1.00: disabled
> ata1.00: device reported invalid CHS sector 0
> (...)
> ata1.00: device reported invalid CHS sector 0
> ata1: EH complete
> sd 0:0:0:0: [sda] Unhandled error code
> sd 0:0:0:0: [sda]
> Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> sd 0:0:0:0: [sda] CDB:
> Write(10): 2a 00 33 13 34 00 00 04 00 00
> end_request: I/O error, dev sda, sector 856896512
> sd 0:0:0:0: [sda] Unhandled error code
>
>
> Boot shows:
> ahci 0000:00:1f.2: version 3.0
> ahci 0000:00:1f.2: irq 42 for MSI/MSI-X
> ahci: SSS flag set, parallel bus scan disabled
> ahci 0000:00:1f.2: AHCI 0001.0300 32 slots 6 ports 6 Gbps 0x13 impl SATA mode
> ahci 0000:00:1f.2: flags: 64bit ncq ilck stag pm led clo pio slum part ems sxs apst
> ahci 0000:00:1f.2: setting latency timer to 64
> scsi0 : ahci
> scsi1 : ahci
> scsi2 : ahci
> scsi3 : ahci
> scsi4 : ahci
> scsi5 : ahci
> ata1: SATA max UDMA/133 abar m2048@0xf2538000 port 0xf2538100 irq 42
> ata2: SATA max UDMA/133 abar m2048@0xf2538000 port 0xf2538180 irq 42
> ata3: DUMMY
> ata4: DUMMY
> ata5: SATA max UDMA/133 abar m2048@0xf2538000 port 0xf2538300 irq 42
> ata6: DUMMY
> scsi6 : pata_legacy
> ata7: PATA max PIO4 cmd 0x1f0 ctl 0x3f6 irq 14
> ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> ata1.00: ACPI cmd ef/02:00:00:00:00:a0 (SET FEATURES) succeeded
> ata1.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) filtered out
> ata1.00: ATA-9: OCZ-VERTEX4, 1.5, max UDMA/133
> ata1.00: 1000215216 sectors, multi 16: LBA48 NCQ (depth 31/32), AA
> ata1.00: ACPI cmd ef/02:00:00:00:00:a0 (SET FEATURES) succeeded
> ata1.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) filtered out
> ata1.00: configured for UDMA/133
> scsi 0:0:0:0: Direct-Access ATA OCZ-VERTEX4 1.5 PQ: 0 ANSI: 5
> sd 0:0:0:0: [sda] 1000215216 512-byte logical blocks: (512 GB/476 GiB)
> sd 0:0:0:0: [sda] Write Protect is off
> sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
> sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
> sda: sda1 sda2 sda3 sda4
> sd 0:0:0:0: [sda] Attached SCSI disk
> ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> ata2.00: ACPI cmd ef/02:00:00:00:00:a0 (SET FEATURES) succeeded
> ata2.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) filtered out
> ata2.00: ATA-8: Hitachi HTS541010A9E680, JA0OA480, max UDMA/133
> ata2.00: 1953525168 sectors, multi 16: LBA48 NCQ (depth 31/32), AA
> ata2.00: ACPI cmd ef/02:00:00:00:00:a0 (SET FEATURES) succeeded
> ata2.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) filtered out
> ata2.00: configured for UDMA/133
> scsi 1:0:0:0: Direct-Access ATA Hitachi HTS54101 JA0O PQ: 0 ANSI: 5
> sd 1:0:0:0: [sdb] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
> sd 1:0:0:0: [sdb] 4096-byte physical blocks
> sd 1:0:0:0: [sdb] Write Protect is off
> sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
> sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
> ACPI: Invalid Power Resource to register!
> ACPI: Invalid Power Resource to register!<6>[ 1.433751] tsc: Refined TSC clocksource calibration: 2893.427 MHz
> Switching to clocksource tsc
> sdb: sdb1 sdb2 sdb3 sdb4
> sd 1:0:0:0: [sdb] Attached SCSI disk
> ata5: SATA link down (SStatus 0 SControl 300)
> scsi7 : pata_legacy
> ata8: PATA max PIO4 cmd 0x170 ctl 0x376 irq 15
>
> ----- End forwarded message -----
>
> --
> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
> Microsoft is to operating systems ....
> .... what McDonalds is to gourmet cooking
> Home page: http://marc.merlins.org/

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/