Re: Reproduceable SATA lockup on 3.7.8 with SSD

From: Marc MERLIN
Date: Tue Feb 26 2013 - 11:50:40 EST


On Tue, Feb 26, 2013 at 10:29:59AM -0500, Jeff Garzik wrote:
> On 02/25/2013 07:27 PM, Marc MERLIN wrote:
> >Howdy,
> >
> >I seem to have the same problem (or similar) as Mathieu Desnoyers in
> >https://lkml.org/lkml/2013/2/22/437
> >
> >I can reliably get my SSD to drop from the SATA bus given the right
> >workload
> >on linux.
> >
> >How can I tell if it's linux's fault of the drive's fault?
>
> Manually force speed to 3.0 Gbps, then 1.5 Gbps, and see what happens.
>
> Try module/kernel parameter libata.force=1.5Gbps or libata.force=3.0Gbps

Ok, so by reading my log at time of failure, you saw that speed was
flipping between the two? (I couldn't see that, but I'm not good at reading
it).

Also, just to make sure, you're not saying that you want me to change the
speed at runtime, but
1) boot once with speed forced at 3Gbps and try and reproduce
2) boot a 2nd time with speed forced at 1.5Gbps and try and reproduce

If libata is not a module in my kernel, I can still put
libata.force=1.5Gbps
on the lilo/grub command line, correct?

Thanks,
Marc

On Mon, Feb 25, 2013 at 08:02:32PM -0500, Mathieu Desnoyers wrote:
> - try diagnostic tools from your drive vendor, if it reports your drive
> as bad, then it might just be your drive failing,

Good point, drive is brand new (just replaced).

> - try to run a SMART test from smartmontools,

Unfortunately, OCZ does not support SMART.

> - try to reproduce your issue with a simple test-case (trying my test
> program might help) that clearly fails quickly, and all the time, on
> your problematic hardware,

My test fails 100% on my hardware too. Very easy to reproduce.
I think it's basically a big amount of read/writes that cause it.

> - find out if there are known firmware upgrades for your drive provided
> by your vendor, try them out,

Did that, I have the latest.

> - find out if there are known BIOS upgrades for your machine provided by
> your vendor, try them out,
> - try test-case on various kernel versions,
> - try test-case on various distributions (just in case),
> - try test-case with power management disabled in your machine's BIOS,
> - try test-case with other SSD drives of the exact same model as
> yours, so you can see if it's just you own drive failing,
> - try moving your drive to a different machine (same model, different
> model), and see if the test-case still fails,
> - try with other SSD drives (from other vendors) on your machine,
> - check if you partition mount options enable TRIM or not, try to
> disable TRIM explicitly (see mount(8), discard/nodiscard option),
> - try using a different filesystem (just in case),
> - try using a different block I/O scheduler,
> - try using your drive vendor's SSD eraser, to reinitialize your entire
> disk (yes, you will lose you entire data). This might be useful if
> TRIM handling has changed after a firmware upgrade for instance.

Those will take a while :) especially without spare hardware.

I'll try older kernels first when I have a chance though.

Thanks for your reply.
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/