Re: SMP 2.1.90-pre3 SCSI kernel panic

sistema@readysoft.es
Mon, 16 Mar 1998 13:07:27 +0100 (MET)


On 16 Mar, Doug Ledford wrote:

>> 2.0.33 UP kernel works flawlessly. 2.0.33 SMP locks hard randomly, even
>> with a BusLogic Flashpoint card instead of the Adaptec one.

How can we explain that I have no problems with a 2.0.33 UP kernel +
aic-5.0.7 patch? The machine does exactly the same work, but with no
errors.

Even fscking, the machine locked up with >=2.1.89, but it completed
right with 2.0.33 UP. I even filled the filesystem completely and
the system kept up and running: all disk sectors full of data.

>> Since 2.1.89, including pre90-[123], SMP kernels keep hanging a later
>> after getting this messages:
>
> Unless someone else knows of a change in 2.1.x that could cause this, I'm
> inclined to attribute this to a change in the way 2.1.x is trying to
> allocate the space on the filesystem. Aka, 2.1.x is trying to write to disk
> blocks that 2.0.x is ignoring. Here's the decode of your sense data:
>
>> Mar 16 10:39:57 rs120 kernel: SCSI disk error : host 0 channel 0 id 0 lun 0 retu
>> rn code = 28000002
>
> The aic7xxx driver got a check condition host status, performed a request
> sense operation, and is alerting the mid level code to the presense of the
> sense data (it was most likely accompanied by an underrun error or else we
> wouldn't have flagged the error code and instead would have let the mid
> level code look at the sense data and decide if there was even an error).
>
>> Mar 16 10:39:57 rs120 kernel: Deferred error sd08:03: sns = f1 4
>> Mar 16 10:39:57 rs120 kernel: ASC= 3 ASCQ= 0
>> Mar 16 10:39:57 rs120 kernel: Raw sense data:0xf1 0x00 0x04 0x00 0x7e 0x55 0x3e
>> 0x0a 0x00 0x00 0x00 0x00 0x03 0x00 0x11 0x80
>
> Broken down, this is a deferred error with valid error information (0xf1 ==
> 0x71 (deferred error) | 0x80 (valid bit)
>
> Sense key of 0x04, quoting from the SCSI-II spec:
>
> 4h HARDWARE ERROR. Indicates that the target detected a non-
> recoverable hardware failure (for example, controller failure,
> device failure, parity error, etc.) while performing the command
> or during a self test.
>
> ASC=0x03, ASCQ=0x00 is found in the table to be:
> 03h 00h DTL W SO PERIPHERAL DEVICE WRITE FAULT
>
> Sounds like a few bad sectors to me.

08:03 is the swap partition.
I get those errors in all the partitions. For instance:

Mar 14 13:16:52 rs120 kernel: SCSI disk error : host 1 channel 0 id 0 lun 0 retu
rn code = 28000002
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Ma
r 14 13:16:52 rs120 kernel: Additional sense indicates Peripheral device write f
ault
Mar 14 13:16:52 rs120 kernel: scsidisk I/O error: dev 08:03, sector 23392, absol
ute sector 8296867
Mar 14 13:16:54 rs120 kernel: SCSI disk error : host 1 channel 0 id 0 lun 0 retu
rn code = 28000002
Mar 14 13:16:54 rs120 kernel: Deferred error sd08:01: sense key Hardware Error
Mar 14 13:16:54 rs120 kernel: Additional sense indicates Peripheral device write
fault
Mar 14 13:16:54 rs120 kernel: scsidisk I/O error: dev 08:01, sector 2602386, abs
olute sector 2602449
Mar 14 13:17:14 rs120 kernel: SCSI disk error : host 1 channel 0 id 0 lun 0 retu
rn code = 28000002
Mar 14 13:17:14 rs120 kernel: Deferred error sd08:01: sense key Hardware Error
Mar 14 13:17:14 rs120 kernel: Additional sense indicates Peripheral device write
fault

>> Last message is a kernel panic. I even get messages complaning about
>> insufficient disk space, but thereīs free space.
>
> The insufficient disk space is probably the result of the ext2fs not being
> able to properly read/write some inode block. The kernel panic would have
> to be posted before I could comment on it.
>
>> Any hints?
>> I can turn on scsi debugging and try to catch that bug with some help.
>
> Best solution to this problem is to get the scsiinfo package, use it to make
> sure the AWRE and ARRE bits are turned on in the read/write error recovery
> mode page on the SCSI drive, then back everything on the drive up, low level
> format the drive, and re-install. If the AWRE and ARRE bits weren't on
> before, then they should help in the future as the drive should
> automatically remap bad sectors out on the fly with those bits set. An
> alarmingly large number of SCSI drives these days ship with this bits turned
> off.
>

With 2.1.88 + aic-5.0.7 the system stopped working sometimes, but I got
no error mesages. I even could see what was happening with the magic
keys; it was not a hard lock.
Now itīs no a hard lock as I can use magic keys, but I canīt even stop
the system properly.

As you (and Michael Weller in a message Iīve just got) suggest, Iīll
enable the AWRE and ARRE bits, as it doesnīt require formatting.

But I donīt understand that I donīt get any problems with 2.0.33 UP +
aic-5.0.7, full filessystem, and I cannot keep the system up with
2.1.89.

It looks like something broken elsewhere.
Thanks for yor help
Pau

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu