Re: SATA problems

From: Pablo Sebastian Greco
Date: Thu Jan 04 2007 - 22:19:34 EST


Pablo Sebastian Greco wrote:
Tejun Heo wrote:
Pablo Sebastian Greco wrote:
By crash I mean the whole system going down, having to reset the entire
machine.
I'm sending you 4 files:
dmesg: current boot dmesg, just a boot, because no errors appeared after
last crash, since the server is out of production right now (errors
usually appear under heavy load, and this primarily a transparent proxy
for about 1000 simultaneous users)
lspci: the way you asked for it
messages and messages.1: files where you can see old boots and crashes
(even a soft lockup).
If there is anything else I can do, let me know. If you need direct
access to the server, I can arrange that too.

Can you try 2.6.20-rc3 and see if 'CLO not available' message goes away
(please post boot dmesg)?

The crash/lock is because filesystem code does not cope with IO errors
very well. I can't tell why timeouts are occurring in the first place.
It seems that only samsung drives are affected (sda2, 3, 4). Hmmm...
Please apply the attached patch to 2.6.20-rc3 and test it.

Thanks.

Here's boot dmesg with 2.6.20-rc3 + blacklist. And you are right about only affecting samsung drives, but since only those drives get all the heavy load, couldn't tell exactly.
I'm putting the server in production right now, so I think in a few hours I'll have more info.

Thanks.
Pablo.
After an uptime of 13:34 under heavy load and no errors, I'm pretty sure your patch is correct. Is there a way to backport this to 2.6.18.x?
Just an off topic question, does anyone know why I get so uneven IRQ handling on 2.6.19-20 and almost perfect on 2.6.20-rc2-mm1?

Thanks for everything.
Pablo.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/