Re: Lockup 2.1.6* => kmalloc/slab

Rik van Riel (H.H.vanRiel@fys.ruu.nl)
Wed, 12 Nov 1997 20:44:12 +0100 (MET)


On Wed, 12 Nov 1997, Rik van Riel wrote:
>
> Ahh, I'll lookup the EIP values... And, btw. it's running an

Well, shortly after posting my previous message, the system crashed
again. This time, I looked up the values and did some tracking in
the kernel source. First the EIP values etc...

EIP: 0010:[<c0195dfc>] EFLAGS: 00000207
ESI: 00000000 EDI: 00000000 EBP:c027f858
DS: 0018 ES: 0018
current task: RC5 cracking client
also runnable: kflushd, klogd.

Since kflushd runs SCHED_FIFO, the system also must have stopped
scheduling.
DS and ES are the same: does this mean something special???
the EIP value matches this function:

AscWaitTixISRDone (from drivers/scsi/advansys.c)

------------------------------------
from linux/drivers/scsi/advansys.c:

STATIC int
AscWaitTixISRDone(
ASC_DVC_VAR asc_ptr_type * asc_dvc,
uchar target_ix
)
{
uchar cur_req;
uchar tid_no;
tid_no = ASC_TIX_TO_TID(target_ix);
while (TRUE) {
if ((cur_req = asc_dvc->cur_dvc_qng[tid_no]) == 0) {
break;
}
---------------------------
possible culprit??? note the while(TRUE). if neither
of the exit conditions are reached (because of faulty
interrupt handlers or something like that) this is
a good place for a loop...
---------------------------
DvcSleepMilliSecond(1000L);
if (asc_dvc->cur_dvc_qng[tid_no] == cur_req) {
break;
}
}
return (1);
}

/*
* Delay for 'n' milliseconds. Don't use the 'jiffies'
* global variable which is incremented once every 5 ms
* from a timer interrupt, because this function may be
* called when interrupts are disabled.
*/
STATIC void
DvcSleepMilliSecond(ulong n)
{
ulong i;

ASC_DBG1(4, "DvcSleepMilliSecond: %lu\n", n);
for (i = 0; i < n; i++) {
udelay(1000);
--------------
yes, this is might be the culprit
--------------
}
}

-------------------------
And from linux/asm/delay.h:
-------------------------
well, we all know what happens here...
included for completeness, maybe somebody can
find something to improve in the functions below
-------------------------
extern __inline__ void __udelay(unsigned long usecs, unsigned long lps)
{
usecs *= 0x000010c6; /* 2**32 / 1000000 */
__asm__("mull %0"
:"=d" (usecs)
:"a" (usecs),"0" (lps)
:"ax");

__delay(usecs);
}

extern __inline__ void __delay(int loops)
{
__asm__ __volatile__(
".align 2,0x90\n1:\tdecl %0\n\tjns 1b"
:/* no outputs */
:"a" (loops)
:"ax");
}
-------------------------------------------
Of course, the actual bug may well be somewhere else. If
somehow some interrupt pointers or whatever other crucial
structure gets clobbered, the system may never reach one
of the exit conditions...
A simple (partial) fix may be (for this driver) to keep
track of all scsi requests (isn't that already done by
mid-level scsi code?) and simply reset the bus after
waiting some (3) seconds... Or even after half a second,
since system freezes of some seconds might be lethal to
other parts of the system (maybe some networking stuff...)

Rik.

----------
Send Linux memory-management wishes to me: I'm currently looking
for something to hack...