Re: 2.0.33 oops, non-fatal, comments anyone? [EIP traced]

Chris Evans (chris@ferret.lmh.ox.ac.uk)
Thu, 26 Mar 1998 22:25:09 +0000 (GMT)


Hi,

Thanks Andrea for looking at this. The more I investigate the more it
looks like a genuine kernel problem rather than naff hardware -- I'm
getting experienced at telling kernel problem from dodgy hardware, having
come across both :)

Perhaps some super-hackers could look at the original oops post?

Shortly after these two oopeses, the machine actually died, with lots of

eth0: Couldn't allocate a sk_buff of size 60.
block on freelist at 00000000 isn't free

I see a trend in people reporting new 2.0.33 problems resulting in lock
ups + scrolling of the eth0: messages.

I was actually logged on to the machine when it died. I was exiting a
"zless something-or-other" and memory ran out instantaneously; /bin/clear
reported memory exhaused! And that was it, machine dead.

More commentry follows...

On Thu, 26 Mar 1998, Andrea Arcangeli wrote:

> On Wed, 25 Mar 1998, Chris Evans wrote:
>
> >Mar 25 02:56:55 ferret kernel: EIP: 0010:[schedule+384/652]
>
> The Oops is in goodness() (linux/kernel/sched.c) in the underlined line:

> if (p->policy != SCHED_OTHER)
> ^^^^^^^^^^^^^^^^^^^^^^^^
> The p->policy doesn' t exists (the p pointer is corrupted). It shouldn' t
> be NULL since the Oops say "general protection: 0000" and not NULL pointer
> derefence of something similar... Am I right here?

%edx was the register being deref'ed -- and it's trash, 0x64636364.
ASCII for "dccd"??

In the second oops, in exactly the same place, %edx is again being
deref'ed -- this time it is 0x9037ff39. Another different completely
corrupt struct task pointer.

Maybe the run/wait queue changes in 2.0.33 are causing trashing??
Here's more surrounding context of the trashed pointer:

p = init_task.next_run;

......

while (p != &init_task) {
int weight = goodness(p, prev, this_cpu);
^^^^^ -- me here

if (weight > c)
c = weight, next = p;
p = p->next_run;
}

So somewhere this linked list is getting corrupt, nasty.

> >Mar 25 02:56:56 ferret kernel: Call Trace: [do_select+133/484]
> >[do_select+397/484] [sys_select+387/596] [udp_rcv+956/976]
> >[ip_rcv+1091/1396] [old_select+63/80] [system_call+85/124]
>
> I will continue to trace the Oops tomorrow (or when I will have some
> time.....).

For now, I have dropped back to 2.0.32 + i_count fix. I am using the
"normal" aic7xxx SCSI driver rather than 5.0.7, too, just so the variables
in this case are fewer. Previously 2.0.32 had uptime >100 days, we'll see
if I can get similar stability again, which would hint 2.0.33 introduced a
problem.

Cheers
Chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu