Re: ncr53c8xx-2.6 feature freeze. Need testers.

Ion Badulescu (ionut@moisil.cs.columbia.edu)
Tue, 14 Apr 1998 09:51:17 -0400 (EDT)


Hi Gerard,

On Mon, 13 Apr 1998, Gerard Roudier wrote:

> > controllers, and a whole bunch of disks, some newer (seagate barracuda 4G)
> > and some older. I'm still fighting with it to bring it into a stable
>
> Better not to mix old drives and recent Fast-20 ones on the same SCSI bus,
> if it is possible.
> This shall work in theory, but old devices may behave good enough when
> sharing the Bus with Fast-10 devices, but be not good enough to share
> the bus with Fast-20 devices.

All the disks I have are narrow, non-ultra SCSI, so this should not be a
problem. My 875's are a little weird, I know that the chip is capable of
handling a wide bus but the controllers have only narrow interfaces on
them.

Anyway, I took your advise and left only the barracuda's on the ncr
controllers, two on each bus. This has improved stability a lot, i.e. no
more total hangs and spontaneous reboots ever since. One of the drives is
still giving me problems though.

> > etc, this happens every few minutes. What does ERROR (0:98) mean, is it
>
> SIST (SCSI status) = 0x98 means:
> - 0x80 : SCSI Phase Mismatch
> - 0x10 : Reselected
> - 0x08 : SCSI gross Error
>
> If this one does not indicate a SCSI problem, likely a BUS problem, I
> will switch to IDE. :)

You're right (as usual :-)), it appears that one of the older drives was
wedging the bus every once in a while. I removed that drive - by initially
attaching it to the buslogic and then removing it completely when it
started causing problems there, too - and now this chain is happy.

> > something I should be worried about? Cabling is good, but I suspect that
> > termination is not up to par on this particular chain, I'll check on it
>
> This may explain that.

It turned out that termination was ok, too. I knew I had active
terminators on the external bus but I wasn't sure about the internal
termination. It was ok though, one drive was terminating by supplying
power to the bus.

Right now, since only the barracuda's are left, the chains are totally
external, so the termination is provided by the controllers on one end and
by active terminators on the other end.

> > ncr53c875-0:5: SIR 18, CCB done queue overflow
>
> That should mean that 12 SCSI commands did complete and the kernel did'nt
> find time to invoke the driver interrupt routine.
> My thought is that Linux is trying to be as fast as a rabbit and sometimes
> is succeeding in being as stupid as this animal. :-)
> Sorry for the joke, I couldn't resist. Flames merited and accepted. ;)

Nah, we all can joke every once in a while about our preferred OS, right?
:)

Anyway, this problem is still present, together with the timeouts
generated by the same drive:

Apr 14 08:52:01 tornado kernel: SCSI host 1 abort (pid 22507810) timed out
- resetting
Apr 14 08:52:01 tornado kernel: SCSI bus is being reset for host 1 channel
0.
ncr53c8xx_reset: pid=22507810 reset_flags=2 serial_number=22511133 serial_number_at_timeout=22511133
ncr53c875-0: resetting, command processing suspended for 2 seconds
ncr53c875-0: restart (scsi reset).
ncr53c875-0-<5,0>: extraneous data discarded.
ncr53c875-0: enabling clock multiplier
ncr53c875-0: copying script fragments into the on-board RAM ...
ncr53c875-0: command processing resumed
ncr53c875-0-<5,0>: FAST-10 SCSI 10.0 MB/s (100 ns, offset 15)
ncr53c875-0-<6,0>: FAST-10 SCSI 10.0 MB/s (100 ns, offset 15)
ncr53c875-0-<5,0>: ordered tag forced, umap/smap=a4705351/4000.
ncr53c875-0-<5,0>: phase change 2-7 10@00fbd234 resid=4.
ncr53c875-0-<5,0>: ordered tag forced, umap/smap=a6de1255/a4501251.
ncr53c875-0-<5,0>: phase change 2-7 10@00fbd430 resid=4.

It goes on and on, most of the time just "ordered tag forced" and "phase
changed" messages, but also timeouts (although not as often). I can give
you timestamps if you need them, but the events don't seem the be related
- I have some of these appearing in the log with nothing else happening in
the 10 minutes before.

Only one thing is different about this drive (0-<5,0>): it has a slightly
lower revision number than the other three. Don't know if that's important
or not:

Vendor: SEAGATE Model: ST15150N Rev: 0019

versus

Vendor: SEAGATE Model: ST15150N Rev: 0022

> > ncr53c875-0-<5,0>: ordered tag forced, umap/smap=dffdc9b7/12910001.
> > ncr53c875-1-<5,0>: ordered tag forced, umap/smap=41e7021b/4167021b.
>
> This looks like the consequence of the done queue overflow, or the timeout
> used to detect commands starvation is perhaps too short.
> Anyway, these are not considered as problems by the driver, but just as a
> should_not_often_happen situations.
> The done queue will be increased to 24 entries in the next driver
> version.

I see. Well, I have provided them anyway in case you want to look more
closely at them. They appear every few minutes on this particular machine
- but then, the activity pattern is scary too, the disk lights are on
almost all the time.

Just looking at /proc/interrupts is enough: :-)
0: 8979918 timer
9: 8502398 + ncr53c8xx
10: 16614730 Digital DS21140 Tulip
11: 9978508 + ncr53c8xx
12: 1969441 3c509
15: 4981532 + BusLogic BT-946C

> You can add these information to your next full success report. :-)

If not full success, definitely "some" success and a more optimistic
perspective. :-) At least the machine doesn't crash anymore (knock wood!)
and although the timeouts are annoying, I could live with them.

Thanks a lot for you help,

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
            than to open it and remove all doubt.

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu