Re: sym/ncr53c8xx phase error still present in 2.2.7

Rainer Clasen (bj@ncc.cicely.de)
Wed, 5 May 1999 23:34:18 +0200


Hi!

Gerard Roudier (groudier@club-internet.fr) wrote:

> On Tue, 4 May 1999, Rainer Clasen wrote:
> > Shaw Carruthers (shaw@shawc.demon.co.uk),
> > "sym/ncr53c8xx phase error still present in 2.2.7":
> > > I did my monthly save under 2.2.7 and hit the phase error again, which I
> > > have been getting since late 2.1.xxx. Random disk data is corrupted and
> > > this time it took me a few hours to get the system back up.
> >
> > I see similar problems with an NCR810A and an IBM DDRS 4.3G. I've removed
> > anything else from the bus, replaced all cables, switched to known good
> > terminators, disabled sync tarbsfer, queueing, disconnect. No luck. I tried
> > 2.0.35, 2.2.[67] - if available with the sym83c8xx driver, too.

ok, some aditional information:

Driver were loaded as module.

Shortest cable I used is roughly 60cm long, has 3 connectors and shipped
with an 2940U. Terminators were all active (didn't use the passive onboard
terminator of the NCR). Easiest Setup was:

active internal terminator
|
| 20cm
|
+NCR
|
| 40cm
|
+DDRS with termination turned on.

Did I get the simplest possible setup? Maybe I should have used an active
external HD-DSub terminator, but I had none. I tried a HD-Dsub to Centronics
Adapter with an external active Centronics Termonator, too.

This is only one out of a zillion SCSI setups I tested.

I used a Gigabyte 586HX2 board (Intel 430HX chipset) with a 200MMX Pentium
and a Gigabyte 6LX1 board (440LX chipset) and a 266MHz P2. Nothing was
ever overclocked, PCI bus runs at regular 33 Mhz.

The 200MMX has 96MB RAM with (real) Parity, ECC is enabled. The P2 has 64MB
PC66 SDRAMS.

> > Replacing either disk (with an elderly 1G Seagate) or controller (with an
> > Adaptec 2940) solves the problem.

I'm currently using the NCR with a 1,50 meter cable and

Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
Vendor: IBM Model: DORS-32160W Rev: WA6A
Type: Direct-Access ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 02 Lun: 00
Vendor: HP Model: 1.050 GB 3rd Rev: 0582
Type: Direct-Access ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 04 Lun: 00
Vendor: PLEXTOR Model: CD-ROM PX-20TS Rev: 1.02
Type: CD-ROM ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 05 Lun: 00
Vendor: TOSHIBA Model: CD-ROM XM-5401TA Rev: 3115
Type: CD-ROM ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 06 Lun: 00
Vendor: TEAC Model: CD-R55S Rev: 1.0K
Type: CD-ROM ANSI SCSI revision: 02

and have no problems. The IBM DORS is a Wide devices with an adapter. The HP
disk is the above mentioned Seagate (relabeled) with SCA-80 connector and
appropriate adapter. I suppose none of this devices does ultra SCSI.

> > I tried to reproduce this with FreeBSD 2.2.6 and 3.0 - Problem disapeared.
> > It ran my stress test for 5 hours. Linux ran only between 5 minutes up to 2
> > hours stable.
> >
> > Somehow I don't think it's hardware-related.

To be more precisely, I think it isn't only hardware's fault. Maybe the
ultra DDRS behaves kind of strange, but the BSD-driver seems to be able to
cpmpensate it. It would be nice if the linux driver did this, too.

> You seem to ignore that most (all?) of pieces that are used to build a
> computer, hardware and software as well, are bogus, and that a system that
> seems to work is just fortunate enough to not have bad bugs being
> triggered.
>
> You can think as long as you want but your computer (as mine) is just an
> assembly of pieces of crap and nothing better. :-)

jepp - until I tested this with FreeBSD on the *same* system I definatly
thought I had hardware that is (too) defective. Maybe FreeBSD *does* special
workarounds, but why can't we, too?

> I base my replies on the technical informations that are send with the
> problem descriptions. I cannot waste all my time to ask informations. I
> have a couple of weeks ago replied to a problem of phase change and it
> seems that I just waste my time.

This inspired me to search linux-kernel archives, too. Damn why didn't I do
this earlier? There are much more useful replies than I found in linux-scsi
archives. See below for some quotes.

> If you want problems to have better chance to be solved, you must provide
> information on your system and not only just describe the symptoms.

The intention was to to supply Shaw with additional Information I had. As
mentioned above I have kind of solved the Problem (although I'm still
interested in really resolving it).

What information do you need exactly? Kernel? Controller? Disk? cabling?
Termination? Most of them were in my mail. I posted more info in a follow up
and in a mail to linux-scsi. I put everything else I could imagine might be
impurtant into this mail. Ofcourse I dont expect you to dig through any
list/archive to find them.

> You testings do not prove anything about the real cause of the problem.
> If bogus A makes problems because of a correct behaviour of B, changing
> the behaviour of B may prevent problems of bogus A from being triggered.

exactly. But if you don't know A you need to find it somehow. And doing
sensless things is still better than doing nothing.

On Sat, 3 Apr 1999 19:20:15 +0200 Gerard Roudier wrote:
> - PCI BUS problem due to broken chip-set of misconfiguration.
> - Memory problem that affects DMA from the PCI BUS.

In two different boards which dont show any problems otherwise? One of those
boxen usually runs 24/7 and is quiet busy. Yes, I know, it's still
possible, but isn't it rather improbable.

> - Kernel or driver bug outside the ncr driver that makes some garbaged
> command go to the SCSI device.

possible. But why isn't the aic7xx driver vulnerable? I doubt, I'm capable
finding the guilty part in this case.

> - vfat inconsistency problem that confused the kernel vfat code, or some bug
> in the kernel vfat code that has been triggered.

FAT is known to be broken ;-) (citing Alexander Viro), but I get phase
change errors with ext2, too.

On Sun, 22 Jun 1997, Hauke Johannknecht wrote:
> scsi-host is a no-probs-till-now-NCR-810 with
> 5 devices attached.
>
> ID 0 -- IBM-DCRS 4.5 GB (new in the system, maybe the troublestarter)

On Sat, 29 Aug 1998 18:41:03 +0200 Gerard Roudier wrote:
> > Controller is an Asus PCI-SC860, disk is an IBM DDRS 34560.
>
> have a DDRS-34560W Firmware revision S71D connected on a 53C875 with
> 3 other Disks on the same SCSI BUS. I donnot have had problems with
> this hard disk for the moment.

Shaw has a DDRS, too. This makes 3 DDRS and one DCRS. This needn't say
anything, but maybe the Ultra (?) IBM drives attrakt the trouble?

Thanks
Rainer

-- 
KeyID=58341901 fingerprint=A5 57 04 B3 69 88 A1 FB  78 1D B5 64 E0 BF 72 EB

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/