Re: sym/ncr53c8xx phase error still present in 2.2.7

Gerard Roudier (groudier@club-internet.fr)
Thu, 6 May 1999 22:49:53 +0200 (MET DST)


Hello,

Thanks for the informations.
Nothing seems wrong to me in your description.

By the way, what I forget to mention in my previous mails is that the
sym53c8xx driver, and the ncr53c8xx driver as well, handles (at least on
paper -> jump to dispatcher) the situation of a target breaking the
COMMAND PHASE by continuing normally with the new phase selected by the
target, so allowing the target to signal the error or to recover from it.

The phase change from COMMAND to STATUS is acceptable if the target
detects wrong data in a CDB. Hopefully, if nothing more severe occured
we can expect the target to return a CHECK CONDITION.

But the phase change from COMMAND TO MESSAGE IN is not acceptable,
because there is no relevant message a target can send in this situation.
Anyway, the driver allow the target to behave as it want.

In SCSI it is the target that decides the transfer phases and not the
initiator. It is also the target that decides to disconnect and not
the initiator. Once the initiator has selected the target or has
been selected by the target, it is the target that has the baton and
not the initiator.

On Wed, 5 May 1999, Rainer Clasen wrote:

> Hi!
>
> Gerard Roudier (groudier@club-internet.fr) wrote:
>
> > On Tue, 4 May 1999, Rainer Clasen wrote:
> > > Shaw Carruthers (shaw@shawc.demon.co.uk),
> > > "sym/ncr53c8xx phase error still present in 2.2.7":
> > > > I did my monthly save under 2.2.7 and hit the phase error again, which I
> > > > have been getting since late 2.1.xxx. Random disk data is corrupted and
> > > > this time it took me a few hours to get the system back up.
> > >
> > > I see similar problems with an NCR810A and an IBM DDRS 4.3G. I've removed
> > > anything else from the bus, replaced all cables, switched to known good
> > > terminators, disabled sync tarbsfer, queueing, disconnect. No luck. I tried
> > > 2.0.35, 2.2.[67] - if available with the sym83c8xx driver, too.
>
> ok, some aditional information:
>
> Driver were loaded as module.
>
> Shortest cable I used is roughly 60cm long, has 3 connectors and shipped
> with an 2940U. Terminators were all active (didn't use the passive onboard
> terminator of the NCR). Easiest Setup was:
>
> active internal terminator
> |
> | 20cm
> |
> +NCR
> |
> | 40cm
> |
> +DDRS with termination turned on.
>
> Did I get the simplest possible setup? Maybe I should have used an active
> external HD-DSub terminator, but I had none. I tried a HD-Dsub to Centronics
> Adapter with an external active Centronics Termonator, too.
>
> This is only one out of a zillion SCSI setups I tested.
>
> I used a Gigabyte 586HX2 board (Intel 430HX chipset) with a 200MMX Pentium
> and a Gigabyte 6LX1 board (440LX chipset) and a 266MHz P2. Nothing was
> ever overclocked, PCI bus runs at regular 33 Mhz.
>
> The 200MMX has 96MB RAM with (real) Parity, ECC is enabled. The P2 has 64MB
> PC66 SDRAMS.
>
> > > Replacing either disk (with an elderly 1G Seagate) or controller (with an
> > > Adaptec 2940) solves the problem.
>
> I'm currently using the NCR with a 1,50 meter cable and
>
> Attached devices:
> Host: scsi0 Channel: 00 Id: 00 Lun: 00
> Vendor: IBM Model: DORS-32160W Rev: WA6A
> Type: Direct-Access ANSI SCSI revision: 02
> Host: scsi0 Channel: 00 Id: 02 Lun: 00
> Vendor: HP Model: 1.050 GB 3rd Rev: 0582
> Type: Direct-Access ANSI SCSI revision: 02
> Host: scsi0 Channel: 00 Id: 04 Lun: 00
> Vendor: PLEXTOR Model: CD-ROM PX-20TS Rev: 1.02
> Type: CD-ROM ANSI SCSI revision: 02
> Host: scsi0 Channel: 00 Id: 05 Lun: 00
> Vendor: TOSHIBA Model: CD-ROM XM-5401TA Rev: 3115
> Type: CD-ROM ANSI SCSI revision: 02
> Host: scsi0 Channel: 00 Id: 06 Lun: 00
> Vendor: TEAC Model: CD-R55S Rev: 1.0K
> Type: CD-ROM ANSI SCSI revision: 02
>
> and have no problems. The IBM DORS is a Wide devices with an adapter. The HP
> disk is the above mentioned Seagate (relabeled) with SCA-80 connector and
> appropriate adapter. I suppose none of this devices does ultra SCSI.
>
> > > I tried to reproduce this with FreeBSD 2.2.6 and 3.0 - Problem disapeared.
> > > It ran my stress test for 5 hours. Linux ran only between 5 minutes up to 2
> > > hours stable.
> > >
> > > Somehow I don't think it's hardware-related.
>
> To be more precisely, I think it isn't only hardware's fault. Maybe the
> ultra DDRS behaves kind of strange, but the BSD-driver seems to be able to
> cpmpensate it. It would be nice if the linux driver did this, too.
>
> > You seem to ignore that most (all?) of pieces that are used to build a
> > computer, hardware and software as well, are bogus, and that a system that
> > seems to work is just fortunate enough to not have bad bugs being
> > triggered.
> >
> > You can think as long as you want but your computer (as mine) is just an
> > assembly of pieces of crap and nothing better. :-)
>
> jepp - until I tested this with FreeBSD on the *same* system I definatly
> thought I had hardware that is (too) defective. Maybe FreeBSD *does* special
> workarounds, but why can't we, too?

In order to work around something, we need some clue on the real problem.
Btw, if I say you that there is no special work-around in the FreeBSD
ncr driver that may address this problem, do you beleive me?
But if FreeBSD has black-listed some firmwares, this may appear as a
work-around.

> > I base my replies on the technical informations that are send with the
> > problem descriptions. I cannot waste all my time to ask informations. I
> > have a couple of weeks ago replied to a problem of phase change and it
> > seems that I just waste my time.
>
> This inspired me to search linux-kernel archives, too. Damn why didn't I do
> this earlier? There are much more useful replies than I found in linux-scsi
> archives. See below for some quotes.

This can help.

> > If you want problems to have better chance to be solved, you must provide
> > information on your system and not only just describe the symptoms.
>
> The intention was to to supply Shaw with additional Information I had. As
> mentioned above I have kind of solved the Problem (although I'm still
> interested in really resolving it).
>
> What information do you need exactly? Kernel? Controller? Disk? cabling?
> Termination? Most of them were in my mail. I posted more info in a follow up
> and in a mail to linux-scsi. I put everything else I could imagine might be
> impurtant into this mail. Ofcourse I dont expect you to dig through any
> list/archive to find them.

People often omit to indicate:

- The chipset and CPU
- The memory type (ECC or not)
- The kernel version and possible patches
- If they have ever overclocked
- If the system and devices are correctly cooled.
- Etc ...

When a problem occurs, it is some symptoms that we are seeing and _often_
the real cause is not obviously correlated with the symptoms that are
observed.

> > You testings do not prove anything about the real cause of the problem.
> > If bogus A makes problems because of a correct behaviour of B, changing
> > the behaviour of B may prevent problems of bogus A from being triggered.
>
> exactly. But if you don't know A you need to find it somehow. And doing
> sensless things is still better than doing nothing.

You must start from some hypothesis and then try to change things that may
confirm your hypothesis.
For example, if you suspect a firmware problem you may disable features,
one at a time, starting from the ones that are known to be often bogusly
handled. In my opinion, the right order is:

1 - Disable Write Caching
2 - Reduce the tag depth
3 - Disable tags
4 - Reduce sync transfer speed
5 - Disable sync transfer
6 - Disable Wide transfer

If you suspect a PCI problem, then you may:

1 - Disable PCI cache based transactions (for the chip)
2 - Decrease burst length
3 - Disable burst
4 - Remove any patch or option that claim to optimize the chipset.
5 - Avoid using graphics or IDE during your testing

> On Sat, 3 Apr 1999 19:20:15 +0200 Gerard Roudier wrote:
> > - PCI BUS problem due to broken chip-set of misconfiguration.
> > - Memory problem that affects DMA from the PCI BUS.
>
> In two different boards which dont show any problems otherwise? One of those
> boxen usually runs 24/7 and is quiet busy. Yes, I know, it's still
> possible, but isn't it rather improbable.

The informations send to the list, seems to confirm this hypothesis, but
at the time I wrote the above these informations haven't been made
available as you certainly know.

> > - Kernel or driver bug outside the ncr driver that makes some garbaged
> > command go to the SCSI device.
>
> possible.. But why isn't the aic7xx driver vulnerable? I doubt, I'm capable
> finding the guilty part in this case.

I am also not in position to find it. It is often hard to find a problem
we can reproduce and I have never got a single weird PHASE CHANGE problems
on any of my systems since I started with ncr chips (1995).

> > - vfat inconsistency problem that confused the kernel vfat code, or some bug
> > in the kernel vfat code that has been triggered.
>
> FAT is known to be broken ;-) (citing Alexander Viro), but I get phase
> change errors with ext2, too.
>
>
> On Sun, 22 Jun 1997, Hauke Johannknecht wrote:
> > scsi-host is a no-probs-till-now-NCR-810 with
> > 5 devices attached.
> >
> > ID 0 -- IBM-DCRS 4.5 GB (new in the system, maybe the troublestarter)

Possibly.

The more severe problem that can happen in SCSI is when more than 1 device
think it has been selected. Several targets tampering the SCSI controls
lines is highly broken and the more probable symptom, as seen from the
initiator, has every chance to be some weird phase sequence.

> On Sat, 29 Aug 1998 18:41:03 +0200 Gerard Roudier wrote:
> > > Controller is an Asus PCI-SC860, disk is an IBM DDRS 34560.
> >
> > have a DDRS-34560W Firmware revision S71D connected on a 53C875 with
> > 3 other Disks on the same SCSI BUS. I donnot have had problems with
> > this hard disk for the moment.
>
>
> Shaw has a DDRS, too. This makes 3 DDRS and one DCRS. This needn't say
> anything, but maybe the Ultra (?) IBM drives attrakt the trouble?

If a firmware update exists, it should be tried, in my opinion.
Disabling SCSI features, one at a time, as I mentionned above should also
be interesting.

Regards,
Gérard.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/