Re: 2.6.29 regression: ATA bus errors on resume

From: Niel Lambrechts
Date: Tue May 26 2009 - 01:44:37 EST


On 05/26/2009 06:58 AM, Tejun Heo wrote:
Hello, Niel.

Niel Lambrechts wrote:
I've tested all of the kernels I have again since 2.6.29.4 also came out
just recently. I did a hibernate/resume for each in the console, then
repeated the same in X, then continued to the next kernel.

The 2.6.29.4 log is much larger, since some other badness happened there
- there is a large kernel trace in there as my first X hibernation
attempt failed and came back to X after a few seconds. The system seemed
functional, it did not keep generating kernel messages - when I then
retried a hibernate it worked, along with the resume. Another unrelated
bug perhaps?

As for "hard resetting link" messages, they seemed to always happen
under X the times I tried it.

Kernel EXT4-errors? Console:ata1 reset? Console:ata2-reset? X:ata1 reset? X:ata2 reset?
2.6.28.10 No no yes yes no
2.6.29.4* No no no no no
2.6.29.4** No - - yes no
2.6.30-rc6 Yes - - yes no
2.6.30-rc6 No no no yes no

* Xorg hibernation attempt failed.
* Xorg Second hibernation attempt (no extra reboot)

I also did a side by side comparison of the messages I have for
2.6.30-rc6, the one with EXT4 errors I reported on yesterday, and
another one that worked just fine tonight. I stripped all time-stamps
and some pulseaudio messages from the bad one and attached them here,
and also saved the full messages for each kernel to
http://bugzilla.kernel.org/show_bug.cgi?id=13017 .

Since analysing the code-path is still a bit beyond me, I'll leave you
with a little summary of the differences I notice.

A = 2.6.30-rc6 (EXT4 clean)
B = 2.6.30-rc6 (EXT4 errors triggered)
Duplicate PHY events are likely to be dependent on timing and
non-deterministic. The ext4 corrupting or not depends on whether a
request with failfast set was in-flight at the time of the second PHY
event, which again is dependent on timing. At any rate, this looks
like a problem of ext4 (or something between ext4 and the driver). It
either shouldn't issue failfast command or should take appropriate
recovery action if it does. It would be really nice if you can give a
shot at ext3.

Urgh. My root file-system is mounted with extents on, I would have to re-install entirely.

I'm wondering why no one else is complaining, or whether the problem is limited to ICH9M/M-E controllers with EXT4 or a certain type of hard-drive. The laptop is a Lenovo W500 (fairly similar to T500), so maybe not a lot of people with this type of controller is using EXT4 yet.

Anyhow, I think Theodore may have ruled this out as a EXT4 problem already (I first copied him) so I'm not sure what to do now, it will take some strong will (and even more time) for me to re-install EXT3. I just shouldn't have to, dammit. :-p

Regards,
Niel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/