Re: Adaptec SCSI Driver fails during mirroring failover testing (2.2.15/2.3.99-pre6)

From: Doug Ledford (dledford@redhat.com)
Date: Thu Apr 27 2000 - 16:33:52 EST


"Jeff V. Merkey" wrote:

> If the SCSI device fails, it should send a test unit ready (0x0),
> followed by an inquiry command(0x12) to reprobe just the one device --

That depends on the type of failure. In general, the desription you give here
is so overly simplified as to not mean anything.

> not disabling the SCSI bus for every active hard disk the driver is
> controlling.

The driver (and certainly not the mid level code either) doesn't do this. The
only thing that will typically result in this type of behavior is when the act
of removing the device caused an electrical/signal problem on the bus that is
now putting the bus into an infinite wedge scenario where no commands to other
devices can get through. I've also seen this in certain types of drive
failures when the drive gets so confused that it esentially never releases the
BUSY pin on the bus, even after repeated bus resets.

> The Scripts should be robust enough to reset the bus
> without a SCSI manager getting involved or this being propagted to some
> upper layer module. The only exception to this would be for non-script
> SCSI chipsets (which this one is not).

If you implement your own timers and want to risk confusing the hell out of
the mid layer SCSI code, then go right ahead and do this. Otherwise, you have
to wait for the mid layer SCSI code to tell you that a command has timed out
and then take appropriate action. This is by design (although one that many
of us bitch about on occasion), not an ommision from the aic7xxx driver.

> (I wrote SCSI scripts on NCR chipsets in 1991-1993 for Memorex Telex for
> Comm and Disk (PC and Mainframe and S370 Channel) and am very familiar
> with how this stuff is supposed to work -- killing a bus because someone
> pulls a swappable device out is a poor implementation -- it shouldn't
> work this way).

No, it shouldn't. But, without the actual error messages or a repeatable case
of this (since I don't have that problem here), there's not much I can do
about it. Since you are using async I/O to do this mirror operation (or at
least I thought you said you were), what's the retry limit on those commands?
Are they retrying forever or when the aic7xxx driver returns an error to the
upper layer is it getting flagged as such and the operation dropped?

> I take it this means that if a single SCSI device ever fails, the SCSI
> module in Linux will potentially disable active devices, and mirroring
> failover on SCSI may not work correctly on some Linux SCSI drivers. FYI
> - The IDE driver works just fine if you unplug the cable from an active
> hard disk (I get IO errors, but can recover the system). If I take what
> you are telling me here at face value, then mirrored failover on SCSI
> may not work without some type of change being made to the SCSI layer.
> FYI -- NetWare and Windows 2000 both handle this just fine on the
> identical hardware.

Actually, I've seen lots of people have great success in the scenario you are
describing with things getting slightly upset until the currently outstanding
commands for the now absent drive are all exhausted, then the system picks up
and goes on like before but without the now missing device.

> Then how do we enable the system not to do this. The SCSI module should
> not be disabling active hard disks just because someone pulls a hard
> disk out of the chassis on one of the SCSI buses.

Nothing in the code that I know of does this. There are only two things I
know of that could cause this. One, the act of removing the drive confused
things on the bus enough that the bus wedged permanently. Two, the upper code
layers are retrying the commands infinitely instead of letting them die which
is resulting in the bus being slammed with inifinite requests for a device
that is no longer there.

NOTE: once you pull that device out, with the exception of any tagged commands
that were active at the time, all future commands from the aic7xxx driver will
get returned after the SELECTION_TIMEOUT has occured. Those commands that
were outstanding to the device will get returned after the first bus reset.
Once they have been returned, the mid layer will requeue them, and this time
they too should get a SELECTION_TIMEOUT.

> BTW. Thanks for responding. :-)
>
> How would you propose I proceed given what you just told me? Is there a
> configuration mode I can give Linux to get around this, or is this just
> unique to the particular hardware configuration I may be running.
> Please advise.

I need a duplicatable test case. I also need to know the nature of the SCSI
bus at the time this all happened. I need to know if maybe the drive was
mostly removed from its contacts but maybe had just enough of it's edge
connector still in contact that it was actually screwing the bus while just
setting there. Anything along these lines can help to track it down. Also,
the original post talks about 2.2.15 and 2.3.99-pre, I need to know which this
happened under (or if both), and you should probably update 2.2.15 with the
latest aic7xxx driver which is on my web site.

-- 

Doug Ledford <dledford@redhat.com> http://people.redhat.com/dledford Please check my web site for aic7xxx updates/answers before e-mailing me about problems

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Apr 30 2000 - 21:00:13 EST