Re: ncr53c8xx SCSI driver locks up computer after timeout message

Gerard Roudier (groudier@club-internet.fr)
Wed, 24 Feb 1999 22:12:13 +0100 (MET)


If I had such an SCSI problem, I would probably track it this way:

1) - Use a setup with all features enabled: 80MB/s, disconnections, tagged
commmads, etc ...
- Only connect the most trustable SCSI device to the SCSI BUS.
Undubitably it is the SEAGATE hard drive. Connect it using the Ultra2
cable with the LVD/SE terminator at its end.
- Remove all other SCSI cables.
- Boot the system.
- Stress the system using heavy IO load.

If the system is not stable, then the device configuration that has the
problem is now a lot simpler than the initial one.

Let us assume that the system is stable. :)
I now would add to the SCSI BUS a second device and then stress again
the system using heavy IOs on the both devices, and so on ...

With such a strategy, you will be able to determine some simplest
combination of devices/cables that fails with the SCSI set-up you intend
to use with your system and the results may give some clue.

BTW, never leave part of the BUS (some end of a cable connected to the
BUS) without a terminator at its end.

The simpler the system that fails the easier it is to find the cause
of the problem.

There is 5 configurations to investigate and with some good guessing 2 or
3 should be enough :), and you can for example continue testing this way:

Example (hypothesis)
- The SEAGATE is ok when it is alone on the BUS.
- The system fails with the SEAGATE + MICROPolis

At this step, I will try different driver set-ups.

Let me imagine that the system always fails using those set-ups. :)

Then, the next thing I will try is to connect only the MICROPolis on the
(a) SCSI BUS and stress the SCSI BUS with heavy IO load.
Having 2 SCSI BUS (controllers) obviously help.

Etc...

Regards,
Gérard.

On Wed, 24 Feb 1999, Bradley M. Kuhn wrote:

> Thus spoke Gerard Roudier:
>
> > IIRC, the board uses a LVD to SE converter that allows to use SE device
> > and LVD device driven by a single SCSI initiator.
>
> > I never tried such a controller, but on controller that does not use LVD
> > to SE converter you must use *at most* 2 cables and terminate the BUS at
> > each end and only at each end. "End" mean the *extreme* end of the cable.
>
> I am doing that. However, I am currently relying on the external JAZ
> drive's terminator for one end of the chain...
>
> Based on your experience, do you think this terminator is not reliable?
>
> > If you dismount the Jaz and the Jaz was not at one end of the BUS, you
> > have been fine, otherwise you just broke your BUS compliance.
>
> So you are saying that the external JAZ drive's own terminator switch is
> faulty?

If it is passive, then it is not conformant to SPI-2 (scsi-3).

> I am going to buy an actual terminator tommorrow, and turn the Jaz drive's
> own terminator off. Do you really think this is my problem, though?

No need to think here. Remove the JAZ from the SCSI BUS and whether your
system fails or not under heavy IO load, you can just make sure.

> > > > 2 - If you donnot have either old 810 rev. < 16 or 815 or 825 rev < 16 on
> > > > your system, give a try with the latest sym53c8xx-1.2a driver that has
> > > > been recently enhanced for SCSI error recovery. Changes haven't yet been
> > > > backported to stock ncr53c8xx driver.
>
> I have a newer 895. I will try your new driver tommorrow evening.

Yes, do it.

> > > > 3 - Send me the messages printed by the driver at bootup using driver
> > > > verbose level 2. Boot command line : ncr53c8xx=verb:2
> > >
> > > I will do this the next time I reboot the machine.
>
> Here is the data:
>

[ ... ]

Nothing syspicious. But I would suggest you to use a normal driver set-up,
if narrowing the features seem not to help.

> Then a failure follows:
>
> Feb 24 05:40:15 kernel: ncr53c8xx_abort: pid=29172 serial_number=29181 serial_number_at_timeout=29181
> Feb 24 05:40:15 kernel: ncr53c895-0: abort ccb=c009a820 (cancel)
> Feb 24 05:40:17 kernel: SCSI host 0 abort (pid 29172) timed out - resetting
> Feb 24 05:40:17 kernel: SCSI bus is being reset for host 0 channel 0.
> Feb 24 05:40:17 kernel: ncr53c8xx_reset: pid=29172 reset_flags=2 serial_number=29181 serial_number_at_timeout=29181
> Feb 24 05:40:17 kernel: ncr53c895-0: resetting, command processing suspended for 10 seconds
> Feb 24 05:40:17 kernel: ncr53c895-0: restart (scsi reset).
> Feb 24 05:40:17 kernel: ncr53c895-0: enabling clock multiplier
> Feb 24 05:40:26 kernel: ncr53c895-0: command processing resumed
> Feb 24 05:40:26 kernel: ncr53c895-0-<0,*>: asynchronous.
> Feb 24 05:40:26 kernel: ncr53c895-0-<4,*>: asynchronous.
>
>
> After this, I got a kernel panic. I am running 2.2.2 now, BTW.

With your set-up, the driver suspends command processing for 10 seconds.
This may perhaps confuse timeouts processing of the SCSI layer if a
command times out within this delay or may trigger some new bug. 2 seconds
is enough. Most O/Ses donnot use more than 3 seconds of such a delay after
a SCSI reset.

Regards,
Gérard.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/