Re: sparc64 mt race?

Roger J. Allen (rja@sis.rpslmc.edu)
Fri, 30 Oct 1998 14:15:57 -0600 (CST)


On Thu, 29 Oct 1998, Rich Sahlender wrote:

> I'm not sure if this is a kernel, scsi, or 32 .vs. 64 bit issue...
>
> Using the latest 2.1.126 from vger cvs with ultrapenguin-1.0.9 on an
> Ultra1, a simple "mt -f /dev/st0 off" to unload a tape sends the cpu

Is the offline command your only problem, or do you also get errors with
other mt commands that access the drive?

> to 100%. Running processes continue but cannot terminate, new processes
> will not start. dmesg shows:
>
> esp0: IRQ 3,7e0 SCSI ID 7 Clock 40 MHz CCF=8 Time-Out 167 NCR53C9XF(espfast) detected
> esp0: FAST chip is fasHME (family=10, version=5)
> ESP: Total of 1 ESP hosts found, 1 actually in use.
> scsi0 : Sparc ESP366-HME
> scsi : 1 host.

My ultra1 170E has the same values for the esp. I think that is where
the problem(s) lie(s).

> Vendor: SEAGATE Model: ST32171W SUN2.1G Rev: 7462
> Type: Direct-Access ANSI SCSI revision: 02
> Detected scsi disk sda at scsi0, channel 0, id 0, lun 0
> Vendor: EXABYTE Model: EXB-85008SQANXBA Rev: 07J0
> Type: Sequential-Access ANSI SCSI revision: 02
> Detected scsi tape st0 at scsi0, channel 0, id 4, lun 0

I may have the same problem, but my dmesg output shows error messages.
This is with an Exabyte 8505XLE attached to the EXTERNAL scsi port,
which works fine with Solaris 2.6. Is your tape drive external or
internal? Maybe it does not matter.

> Vendor: TOSHIBA Model: XM5701TASUN12XCD Rev: 0997
> Type: CD-ROM ANSI SCSI revision: 02
> Detected scsi CD-ROM sr0 at scsi0, channel 0, id 6, lun 0
> scsi : detected 1 SCSI tape 1 SCSI cdrom 1 SCSI disk total.
> esp0: Disabling sync for buggy Toshiba CDROM.
^^^^^
The Exabyte may have similar ?bug(s)?.

> esp0: Disabling DISCONNECT for target 6 lun 0
> esp0: target 6 asynchronous
> Uniform CDROM driver Revision: 2.14
> esp0: Disabling DISCONNECT for target 0 lun 0
> esp0: 16 byte WIDE transfers enabled for target 0.
> esp0: target 0 [period 100ns offset 15 20.00MHz FAST-WIDE SCSI-II]
> SCSI device sda: hdwr sector= 512 bytes. Sectors= 4157201 [2029 MB] [2.0 GB]
> sunhme.c:v1.2 10/Oct/96 David S. Miller (davem@caipfs.rutgers.edu)
> eth0: HAPPY MEAL (SBUS) 10/100baseT Ethernet 08:00:20:8d:21:cd
> eth0: Link is up using internal transceiver at 100Mb/s, Full Duplex.
>
> Nothing unusual there nor in syslog. Is there anything else I can
> provide to help?
>

Instead of "mt -f /dev/st0 off", try "mt -f /dev/nst0 status". When I
try to get the status, I get esp parity errors:

esp0: Disabling DISCONNECT for target 4 lun 0
esp0: data bad parity detected.
esp0: data bad parity detected.
esp0: yieee, bytes_sent < 0!
esp0: csz=0 fifocount=0 ecount=0
esp0: use_sg=0 ptr=0000000000500006 this_residual=0
esp0: Forcing async for target 4
esp0: got status only, esp0: bad parity somewhere mout= 5
esp0: still in msgout, parity error assumed
esp0: data bad parity detected.
esp0: data bad parity detected.
esp0: yieee, bytes_sent < 0!
esp0: csz=0 fifocount=0 ecount=0
esp0: use_sg=0 ptr=000000000050000c this_residual=0
esp0: Forcing async for target 4
esp0: got status only, esp0: bad parity somewhere mout= 5
esp0: still in msgout, parity error assumed

Someone mentioned checking the termination. Since the ultra has a 68
pin scsi and the Exabyte has a 50 pin centronics connector, I tried an
active terminator and a 68 to 50 pin adapter with high bit termination,
but that did not help (nor hurt).

One symptom that I noticed was the output from /proc/scsi/esp/0. Before
the tape drive is accessed, it shows:

Sparc ESP Host Adapter:
PROM node fffffffff0061044
PROM name SUNW,fas
ESP Model Happy Meal FAS
DMA Revision Rev HME/FAS
Live Targets [ 0 1 4 6 ]

Target # config3 Sync Capabilities Disconnect Wide
0 000000e3 [5f,04] no yes
1 000000e3 [5f,04] no yes
4 000000a1 [5f,04] no no
6 000000a1 [5f,04] no no

But after running an "mt -f /dev/nst0 status" command, the esp complains
with the esp errors and the cat /proc/scsi/esp/0 shows:

Sparc ESP Host Adapter:
PROM node fffffffff0061044
PROM name SUNW,fas
ESP Model Happy Meal FAS
DMA Revision Rev HME/FAS
Live Targets [ 0 1 4 6 ]

Target # config3 Sync Capabilities Disconnect Wide
0 000000e3 [5f,04] no yes
1 000000e3 [5f,04] no yes
4 000000e1 [5f,04] no yes
6 000000a1 [5f,04] no no

Which shows the tape drive on target 4 as a WIDE scsi instead of NARROW!

If I try to read from the drive, then the esp parity errors are constant
and nothing can be done except reboot.

In the kernel drivers/scsi/esp.c code, there is a section where it
checks to see if the device is wide or not. There is some code that
claims that Toshiba CD-ROMS are buggy and they do not get checked to see
if they are wide (or synchronous). I copied the Toshiba code where it
checks if the device is wide, changed it to also check for Exabyte tape
drives, re-built the kernel, rebooted, and then I could read from the
tape drive without any errors!

Here is what I added (I wish I changed the order of the files for the
diff command):

*** esp.c Thu Oct 29 18:17:11 1998
--- esp.c-dist Tue Sep 22 19:13:54 1998
***************
*** 1429,1436 ****
*/
if(esp->erev == fashme && !SDptr->wide) {
if(!SDptr->borken &&
- (SDptr->type != TYPE_TAPE ||
- strncmp(SDptr->vendor, "EXABYTE", 7)) &&
(SDptr->type != TYPE_ROM ||
strncmp(SDptr->vendor, "TOSHIBA", 7))) {
build_wide_nego_msg(esp, 16);
--- 1429,1434 ----

Maybe the test should be for just my model of Exabyte, or all Exabyte's,
like the above code. Are there wide Exabyte tape drives that this would
break? Are all Toshiba cd-rom's buggy, narrow, and asynchronous?

After changing the code in esp.c, I am still getting timeouts on the
scsi bus when I rewind the tape drive for only a few hundred megabytes.
This is when the tape is read without any errors using:

cpio -itv -C 1024 -I /dev/nst0

esp0: Disabling DISCONNECT for target 4 lun 0
esp0: target 4 [period 200ns offset 11 5.00MHz synchronous SCSI]

Later, when it is rewound, these occur:

scsi : aborting command due to timeout : pid 9822, scsi0, channel 0, id 1, lun 0 Write (6) 00 40 38 02 00
esp0: Aborting command
esp0: dumping state
esp0: dma -- cond_reg<b2b70a10> addr<1e000000>
esp0: SW [sreg<17> sstep<04> ireg<20>]
esp0: HW reread [sreg<12> sstep<cb> ireg<00>]
esp0: current command [tgt<04> lun<00> pphase<UNISSUED> cphase<SLCTNORM>]
esp0: disconnected
SCSI disk error : host 0 channel 0 id 1 lun 0 return code = 50000
scsidisk I/O error: dev 08:11, sector 16440
scsi : aborting command due to timeout : pid 9823, scsi0, channel 0, id 1, lun 0 Write (6) 02 40 30 02 00
esp0: Aborting command
esp0: dumping state
esp0: dma -- cond_reg<b2b70a10> addr<1e000000>
esp0: SW [sreg<17> sstep<04> ireg<20>]
esp0: HW reread [sreg<12> sstep<cb> ireg<00>]
esp0: current command [tgt<04> lun<00> pphase<UNISSUED> cphase<SLCTNORM>]
esp0: disconnected
SCSI disk error : host 0 channel 0 id 1 lun 0 return code = 50000
scsidisk I/O error: dev 08:11, sector 147504
scsi : aborting command due to timeout : pid 9974, scsi0, channel 0, id 1, lun 0 Read (6) 02 40 30 02 00
esp0: Aborting command
esp0: dumping state
esp0: dma -- cond_reg<b2b70a10> addr<1e000000>
esp0: SW [sreg<17> sstep<04> ireg<20>]
esp0: HW reread [sreg<12> sstep<cb> ireg<00>]
esp0: current command [tgt<04> lun<00> pphase<UNISSUED> cphase<SLCTNORM>]
esp0: disconnected
SCSI disk error : host 0 channel 0 id 1 lun 0 return code = 50000
scsidisk I/O error: dev 08:11, sector 147504

I also get timeouts with "mt -f /dev/st0 offline".

My wild guess is that the disabling of DISCONNECT for all devices on the
fashme/fasHME/espfast/ESP366-HME/NCR53C9XF (a few lines later in esp.c)
is causing the timeouts.

-- 
Roger J. Allen                Rush-Presbyterian-St. Luke's Medical Center
System Administrator                                   Chicago,  IL   USA
Surgical Information Systems                       Voice:  (312)-942-4825
Internet: rja@sis.rpslmc.edu                         FAX:  (312)-733-6921

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/