Re: Panic in scsi.c ( and a fix)

Richard Waltham (dormouse@farsrobt.demon.co.uk)
Wed, 9 Sep 1998 02:49:05 +0100 (BST)


Ishikawa wrote:
>
> Kurt Garloff wrote:
> >
> > On Wed, Aug 19, 1998 at 01:52:03AM +0100, Richard Waltham wrote:
> > > Hi,
> > >
> > > I can generate the following panic in scsi.c at will using a CD media
> > > changer - Nakamichi MBR-7.
> > >
> > > Happens with kernel versions 2.0.35 and 2.0.36-pre6. I haven't checked any
> > > others.
> > >
> > > Attempt to allocate device channel 0, target 6, lun x
> > > Kernel Panic: No Device found in allocate_device().
> > >
> > > If I start the following two commands running in different vc's
> > >
> > > dd if=/dev/scdX of=/dev/null (X = 1, 2 ...)
> > >
> > > dd if=/dev/scdY of=/dev/null (Y = 0, 1 ...)
> > >
> > > and the second one started has Y < X I get the panic.
> > >
> > > eg
> > >
> > > dd if=/dev/sdc1 of=/dev/null - starting this first
> > >
> > > dd if=/dev/scd0 of=/dev/null - then starting this
> > >
> > > generates the panic. Starting scd0 first and then scd1 is OK - but very
> > > sloooooow as its spending most of the time changing CDs;)
> > >
> > > I guess the panic is caused by the call to allocate_device from
> > > do_sr_request in sr.c but don't know why.
> > >
> > > Anyone figure it out?
> >
> > There have been reports that the code in scsi.c doesn't correctly honour the
> > BLIST_SINGLELUN and that this causes problems with Nakamichi MBR-7(2) and
> > certain host adapter settings. (Probably other devices will be affected too.)
> >
> > Adding the NAKAMICHI to the blacklist (BLIST_SINGLELUN) and applying the
> > appended patch may help you. Chiaki Ishikawa reported successful operation
> > after creating the patch.
> >
> > If I correctly judge what I see, this was a bug in scsi.c. Alan, is this bug
> > still in 2.0.36? (I'm not sure if it's the correct fix, though. There have
> > been reports about missing locking in sr.c, too ...)
> >
>
> A couple of weeks ago or so, there was a mention of
> reproducible system panic when the following
> commands were issued agains Nakamichi MBR SCSI CD changer (The above
> exchange, that is.):
>
> dd if=/dev/scdx of=/dev/null &
> dd if=/dev/scdy of=/dev/null
>
> (x>y)
>
> I have the same (or similar) Nakamichi SCSI
> CD changer and I found that I can reproduce the
> same problem on my PC. The MBR7 SCSI CD changer is connected to Tekram DC390
> SCSI host adaptor card. The DC390 driver version is 1.20s2.
>

That was me I guess. My SCSI controller is a Symbios SYM8751SP using the
ncr53c8xx driver (and a new experimental version sym53x8xx v0.4) but both
give the exact same results.

> [By the way, in my case, the system got hung even when x< y. It could
> be due to the fact my system had this "single_lun" problem patch
> applied. But I don't know the real reason.]
>

No, this behaviour has nothing to do with the single_lun patch. But you will
need the patch to get things working along with an additional patch I'll
append at the end of this message.

I've looked in to this. Adding a printk in the routine allocate_device in
scsi.c displays the devices as the drive table is searched. On my system the
high luns are scanned first going to the lowest lun last. This also ties in
with the failure I was getting where starting a high lun first and then a
lower lun caused a panic.

Other systems/drivers may order the devices the other way round, lowest luns
first then scanning through to the higher luns. This would then give a
failure when starting a lower lun before a higher lun.

The following patch to scsi.c will show the order devices are scanned if
you're interested. It is not part of the fix.

--- linux-2.0.36-pre8/drivers/scsi/scsi.c~ Wed Sep 9 01:10:19 1998
+++ linux/drivers/scsi/scsi.c Wed Sep 9 01:12:24 1998
@@ -1081,6 +1081,10 @@
target_busy = 0;
SCpnt = device->host->host_queue;
while(SCpnt){
+ printk("single_lun: (%d,%d,%d)\n",
+ device->channel,
+ device->id,
+ device->lun);
if(SCpnt->channel == device->channel
&& SCpnt->target == device->id) {
if (SCpnt->lun == device->lun) {

> I tried to see what is going on and could produce some
> printk() messages right before the panic occurs.
> This might be helpful in deducing the cause of the bug and so
> I am reporting this message.
>
> In the following session log I manually recorded ,
> scd2 is the CD at id=6,lun=0, and
> scd3 is the CD at id=6,lun=1 if I am not mistaken.
>

Seems reasonable to me.

> The kernel I tested was 2.0.36pre7.
> (I obtained the patches from the site mentioned in Alan's message
> quoted in Linux Weekly News site.
> Can I simply run "patch -p1 < patch_for_pre7"
> instead of running patch_for_pre1, then for pre2, etc. in order to
> get the 2.0.36pre7 source tree?
> When I tried to apply patches in sequence, I got
> the dreaded "reverse patch deteced" message and after looking
> at the files, I figured that
> each patch can be applied to the base 2.0.35 source tree in one operation to
> get to the pre-NNN status. Correct me if I am wrong here.)
>

The 2.0.36-pre patches apply to a clean 2.0.35. _Do not_ try adding a 2.0.36
pre-patch on top of another pre-patch.

8< some text removed

>
> I don't know if this message
> helps people in fixing the bug before 2.0.36 release, but
> this bug got to be fixed somehow in 2.0.3x release, I think.
>

It may be fixed - try the appended patch and let me know. I share your
concern.

> If someone wants to delve into this problem and would like me to
> print more info by inserting printk() in the source files,
> just let me know.

I have, and have what I believe is a fix, so don't need any more printk's -
besides there's no more room in my log files for any more messages as I've
filled them up with my own;)

>
> By the way, is the original reporter of this problem
> using DC390 or other SCSI cards?

Symbios SYM8751SP and SYM8951U.

> I don't think DC390 driver is not the cause of the problem, but
> just wanted to make sure that the problem occurs with the combination of
> Nakamichi SCSI CD changer and other SCSI cards.

Doesn't appear to be a driver problem though the failure depending on higher
or lower lun starting first appears driver dependent.

>
> Finally, here is the manual recording of the messages shown on the console
> after I typed the problematic commands and
> when the system paniced:
>

8< text removed (sorry)

>
> I don't know if the problem is caused by the improper protection of
> the various allocate routines in sr*.c files as mentioned in the
> case of 2.1.1xx kernel lately.

It doesn't appear to be caused by any problems in sr.c but I've only had it
running a couple of hours.

> But my understanding of SCSI subsystem of linux 2.0.35 is not good
> enough to make any judgement now, and for that matter, producing
> a patch for 2.0.35 based on the recent patch for protecting these functions
> based on SMP lock/unlock functions in 2.1.1xx is beyond me now.
>

Mines not very good either but I do understand SCSI and the errors that were
introduced in the driver by this fault helped a lot. + lots of printk's.

> The output messages were produced by the insertion of printk in the
> following places in the relevant files:

I'll save these just in case

8< lots of code snipped

>
> Happy Hacking
>
> Chiaki Ishikawa
>
> PS: I am sorry that I only read linux-scsi mailing list...
>

Now for the patch which I think finally fixes these problems with at least
the Nakamichi drives. This is to the allocate_device routine in scsi.c

The panic was being caused by the routine scanning the command queues for
devices finishing the scan prematurely before the device being allocated had
been scanned leaving the SCwait pointer set to NULL - the reason for
the panic.

The scan order appears to depend on the ordering done by the different
drivers so the failure could occur if a command with a high lun was started
first before a command with low lun or vice versa. Between us we appear to
cover both cases.

The fix forces the complete list to be scanned saving status on whether any
of the relevent target luns are busy and then setting the state of SCpnt
accordingly on completion. This guarantees that SCwait is set to a valid
value for the device being allocated.

This needs more testing so it can hopefully get into 2.0.36 - is this a
possibility Alan?

Check this out - its against 2.0.36-pre8 but will probably apply equally
well to earlier 2.0.36-pre versions (pre7 anyway). Let me know how you get
on. If its OK I'll pass it on to Alan if he's not read this and already done
something about it.

Now on to try and fix 2.1.xx :)

Bye for now
Richard

--- linux-2.0.36-pre8/drivers/scsi/scsi.c.orig Tue Sep 8 19:52:46 1998
+++ linux/drivers/scsi/scsi.c Wed Sep 9 00:33:22 1998
@@ -1045,6 +1045,7 @@
kdev_t dev;
struct request * req = NULL;
int tablesize;
+ int target_busy;
unsigned long flags;
struct buffer_head * bh, *bhp;
struct Scsi_Host * host;
@@ -1077,6 +1078,7 @@
SCpnt = SCpnt->device_next;
}
} else {
+ target_busy = 0;
SCpnt = device->host->host_queue;
while(SCpnt){
if(SCpnt->channel == device->channel
@@ -1095,13 +1097,15 @@
* outstanding command per device - this is what tends
* to trip up buggy firmware.
*/
- found = NULL;
- break;
+ target_busy = 1;
}
}
SCpnt = SCpnt->next;
}
- SCpnt = found;
+ if (target_busy)
+ SCpnt = NULL;
+ else
+ SCpnt = found;
}

save_flags(flags);

-- 
 Richard Waltham   |              Work: richard@digitalinterface.demon.co.uk
   At home in      |              Home:        dormouse@farsrobt.demon.co.uk
 Southampton UK    |                              100421.1276@compuserve.com

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/faq.html