Re: Panic in scsi.c ( and a fix)

Ishikawa (ishikawa@yk.rim.or.jp)
Sat, 12 Sep 1998 08:59:44 +0900


Hello,

Richard Waltham wrote:
>

> I've looked in to this. Adding a printk in the routine allocate_device in
> scsi.c displays the devices as the drive table is searched. On my system the
> high luns are scanned first going to the lowest lun last. This also ties in
> with the failure I was getting where starting a high lun first and then a
> lower lun caused a panic.
>
> Other systems/drivers may order the devices the other way round, lowest luns
> first then scanning through to the higher luns. This would then give a
> failure when starting a lower lun before a higher lun.
>
It seems that DC390 also scans from higher lun to lower lun
from looking at the output using your printk() statement.

> The following patch to scsi.c will show the order devices are scanned if
> you're interested. It is not part of the fix.

[patch omitted.]

>
> The 2.0.36-pre patches apply to a clean 2.0.35. _Do not_ try adding a 2.0.36
> pre-patch on top of another pre-patch.
>

Thank you for the info. I have now tested your patch
on 2.0.36-pre8.

> It may be fixed - try the appended patch and let me know. I share your
> concern.
>
> > If someone wants to delve into this problem and would like me to
> > print more info by inserting printk() in the source files,
> > just let me know.
>
> I have, and have what I believe is a fix, so don't need any more printk's -
> besides there's no more room in my log files for any more messages as I've
> filled them up with my own;)
>
...
> > I don't think DC390 driver is not the cause of the problem, but
> > just wanted to make sure that the problem occurs with the combination of
> > Nakamichi SCSI CD changer and other SCSI cards.
>
> Doesn't appear to be a driver problem though the failure depending on higher
> or lower lun starting first appears driver dependent.
>

It looks now that DC390 card shows a slightly different problem.

>
> Now for the patch which I think finally fixes these problems with at least
> the Nakamichi drives. This is to the allocate_device routine in scsi.c
>
> The panic was being caused by the routine scanning the command queues for
> devices finishing the scan prematurely before the device being allocated had
> been scanned leaving the SCwait pointer set to NULL - the reason for
> the panic.
>
> The scan order appears to depend on the ordering done by the different
> drivers so the failure could occur if a command with a high lun was started
> first before a command with low lun or vice versa. Between us we appear to
> cover both cases.
>
> The fix forces the complete list to be scanned saving status on whether any
> of the relevent target luns are busy and then setting the state of SCpnt
> accordingly on completion. This guarantees that SCwait is set to a valid
> value for the device being allocated.
>
> This needs more testing so it can hopefully get into 2.0.36 - is this a
> possibility Alan?
>
> Check this out - its against 2.0.36-pre8 but will probably apply equally
> well to earlier 2.0.36-pre versions (pre7 anyway). Let me know how you get
> on. If its OK I'll pass it on to Alan if he's not read this and already done
> something about it.
>
> Now on to try and fix 2.1.xx :)
>
> Bye for now
> Richard
>

[patch omitted]

Now, I have tested the patch on 2.0.36pre8 with DC390 driver
version 1.20t. Nakamich CD changer is on SCSI ID=6 on the DC390
card SCSI chain.
Ordinary operation seems to work OK.

However, when the test was done using the command sequence
mentioned earlier, namely,

dd if=/dev/scd2 of=/dev/null &
dd if=/dev/scd3 of=/dev/null

the system got hung.
But this time around, I know why it seems to get hung.
I had inserted the printk() to print the lun scanning in the
while loop as suggested.
The console screen was filled with the output of the printk() with
different LUNs (presumably counting from up to down and then
repeat the sequenc again, and again...)

It seems that the either the while loop never exit(!!!),
or the allocate_device() was called(?) again and again.

Nothing seems to get done within the kernel owing to this
seemingly never ending loop!

The possibility is that the internal data structure got
corrupted, or the higher-level strategy is not good.

[I am not sure if this is relevant, but I have TWO SCSI cards and
DC390 is inserted as a driver module and
given host adaptor Nr 1. I think the other card Symbios NCR 8xx card
is given host adaptor No. 0. I mention this because the printk() showed
(0,6,n) n is the varying LUN. I wondery why the first number is not 1(?)]

Anyway, there could be a more subtle cause of the bug
depending on the timing or etc. of the
adaptor card, CD changer mechanism, etc...

Again, if someone is interested in finding the real cause, I would not
mind inserting printk() statements in key places, so suggestions will be
appreciated.

Otherwise, the patch didn't seem to affect ordinary operations adversely,
so it seems to be in the right direction.
By ordinary operations, I mean operations such
as running "ls -lR" on two mounted CDs with deep directory
tree. This was my stock testing of
SCSI driver using Nakamichi CD changer before. This operation seems to
drive the hardware/software of the adaptor and the changer mechansim
(lots of mechanical noise) to its limit :-)

Oh, one more thing I noticed (this could be artifact of running
tail -f syslog-output-file and tail -f klogd-output-file on the
console and su'ed to root.): while running "ls -lR" for testing purposes,
I noticed duplicated
(0, 6, n) n being the same
in succession. That is (0,6,5), (0,6,5), ...maybe 4 to 8 times,
(0,6,4),...4 to 8 times, ..., for n=3,2, ...(0,6,1), ...
The repetion could be longer. Since it scrolled off the screen
so fast, I couldn't count them reliably.
It was a little strange to think that the repetion on the
same LUN was necessary. (Also while the
printk() output was repeated "from higher to lower LUN" cycle,
the repetition seemed to get longer...
Again, this could be the artifact
caused by the logging daemons.

Anyway, this is what I found so far regarding the problem.

I would try adding a few more printk() on my own over the weekend to
see if I can get a clue.

Happy Hacking

Chiaki Ishikawa

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/faq.html