RE: PATCH: Further aacraid work

From: Byron Stanoszek
Date: Tue Jun 29 2004 - 15:22:14 EST


On Tue, 29 Jun 2004, Salyzyn, Mark wrote:

Although I am gratified that 1.1.5-2345 works, and works well (*many*
performance improvements have occurred in this driver, but some of those
have reached Alan's patch), I am at a loss because I was sure that the
lockup was associated with commands taking longer than 60 seconds to
complete (a 'can happen' with a complicated RAID card with multiple
targets, that unfortunately gives the SCSI layer and especially ext3
some headaches). The workaround in the reset handler is to let the
commands quiesce to give the Firmware some breathing space to respond to
the test unit ready command that is issued to the target immediately
following return from the reset handler.

This not solving the problem has the implication, as Alan had noted,
that this could be an issue with the adapter itself locking up or going
unresponsive under load. Both reports, this one and "PERC 3/Di broken
2.6.6-mm5 -> 2.6.7-rc1-mm" now smell of some subtle problems creeping
in.

Turns out I spoke too soon. I repeated the rsync and logged this information
from the serial console:

aacraid: Host adapter reset request. SCSI hang ?
aacraid: Host adapter appears dead -3
SCSI error : <1 0 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 8079007
Buffer I/O error on device sda1, logical block 1009868
lost page write due to I/O error on sda1
scsi1 (0:0): rejecting I/O to offline device
Buffer I/O error on device sda1, logical block 1009869
lost page write due to I/O error on sda1
...etc
Buffer I/O error on device sda1, logical block 1009877
lost page write due to I/O error on sda1
SCSI error : <1 0 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 8079159
scsi1 (0:0): rejecting I/O to offline device
...etc


FYI, I have another [production] server sitting next to this one running
aacraid and so far it hasn't had any problems. It's a PowerEdge 2450, unsure as
to which PercRaid device it has. So now I'm comparing two systems:

Production: PowerEdge 2450, aacraid driver (1.1.2-lk1), PERCRAID Mirror, Unknown HW
AAC0: kernel 2.1.4 build 2951
AAC0: monitor 2.1.4 build 2951
AAC0: bios 2.1.0 build 2951
AAC0: serial 73840ad0fafaf001

and

Development: PowerEdge 2400, aacraid driver (1.1-5[2345]), PERC RAID5, Perc 2/Si
AAC0: kernel 2.1-3 build 2939
AAC0: monitor 2.1-3 build 2939
AAC0: bios 2.1-3 build 2939
AAC0: serial 58801d0

Thing I noticed is that the firmware build on my Development server (the one
that's crashing) is older. Where can I find a firmware upgrade? I've been
scouring Dell's site the last 20 minutes and all I can find are Linux drivers.

Can we get some additional tests on this?

1) Does this problem occur on a single CPU (Hyperthreading disabled
too)? I am assuming some locking issues appeared, mainly because of some
restructuring in that code.

I'm going to run the above test right now and let you know how it turns out.

Just for some history, I've been running 2.4 kernels on this machine for 2
years and haven't had a problem with the aacraid driver. Also, the default
Fedora Core 2 install I had on it yesterday seems to also work flawlessly too,
and it's a 2.6.5-smp system.

-Byron

--
Byron Stanoszek Ph: (330) 644-3059
Systems Programmer Fax: (330) 644-8110
Commercial Timesharing Inc. Email: byron@xxxxxxxxxxx
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/