Re: 2.6.22-rc3 hibernate(?) fails totally - regression (xfs on raid6)

From: Tejun Heo
Date: Thu Jun 14 2007 - 10:31:20 EST


Hello,

Rafael J. Wysocki wrote:
> On Thursday, 14 June 2007 00:12, Linus Torvalds wrote:
>> On Wed, 13 Jun 2007, David Greaves wrote:
>>>> I'm not seeing anything really obvious. The traces would probably look
>>>> better if you enabled CONFIG_FRAME_POINTER, though. That should cut down on
>>>> some of the noise and make the traces a bit more readable.
>>> I can do that...
>> Thanks. That makes a big difference to the readability of the traces.
>>
>> That said, I'm so used to reading even the messy ones that this didn't
>> actually tell me anything new (it made it clear that the SCSI error
>> handler noise was just noise), but for people who aren't quite as used to
>> seeing crap backtraces, your new trace might hopefully put them on the
>> right track.
>>
>> I threw out the parts that didn't look all that relevant, and left the
>> ata_aux/md0_raid5/hibernate traces here for others to look at without all
>> the other noise. Those _seem_ to be the primary suspects in this saga.
>
> Hmm, it looks like both hibernate and ata_aux are waiting for the same
> completion. I wonder who's supposed to complete it.

They're waiting for the commands they issued to complete. ata_aux is
trying to revalidate the scsi device after libata EH finished waking up
the port and hibernate is trying to resume scsi disk device. ata_aux is
issuing either TEST UNIT READY or START STOP. hibernate is issuing
START STOP.

This can be caused by one of the followings.

1. SCSI EH thread (ATA EH runs off it) for the SCSI device hasn't
finished yet. All commands are deferred while EH is in progress.

2. request_queue is stuck - somehow somebody forgot to kick the queue at
some point.

3. command is stuck somewhere in SCSI/ATA land.

#1 doesn't seem to be the case as all scsi_eh threads seems idle. I'm
looking at the code but can't find anything which could cause #2 or #3.
Also, these code paths are traveled really frequently.

I'm also trying to reproduce the problem here with xfs over RAID-6 array
but haven't been successful yet.

David, do you store the hibernation image on the RAID-6 array? Can you
post the captured kernel log when it locks up?

--
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/