Re: Sporadious hang on 2.0.3[0,1,2,3,4pre2]

Manfred Petz (pm@radawana.cg.tuwien.ac.at)
Thu, 5 Mar 1998 09:40:18 +0100 (CET)


> thing that is really sticking in my mind right now is this. 2.0.29 didn't
> have these problems (at least, not reportedly in the net code). The net
> code in 2.0.30 - 33 looks like it should work without these problems, so I
> surmise something is happening behind the sceenes. The networking is the
> only similarity between these crashes, etc. The net code makes heavy use of
> kmalloc/kfree for net buffers, etc. Other portions of kernel code make far
> lighter use of these functions (eg., the SCSI subsystem almost always

I've been running 2.0.33+tcpdebug for 5 days until it crashed (sorry, I
was in a hurry so I just rebooted the machine without inspecting it further).

The machine locked up completely and filled the console with
messages like:

.... couldn't get a free skbuff ...
.... couldn't get a free page ...

There was no output from the debug-skbuff in the logs.

Again, I *really* don't think that there's a hardware-related problem,
2.0.31 and previous versions had uptimes > a month and _never_ had a
problem.

I'm running 2.0.33+tcpdebug for 1 day now, when it stops again you'll
hear from me with a detailled report. And I hope it locks up again
soon. But since I've found nothing in the logs after the previous
crash, I'm not sure if it helps...

What about this:
----------------

Let's assume that at least my problem here is related to a defective
skbuff list caused by some other kernel-code, maybe not even the networking-
code. What about adding some kind of CRC to each skbuff head and walking
down the whole list upon free_skb()/alloc_skb() (is it a list and is this
possible?) and possibly on other frequently called places in the kernel,
bringing the system to an immediate halt if a CRC doesn't match and
displaying as much information as possible?? Or is there a better place
doing this kind of checking?

Does this make any sense? I'd love to test it. :)

> ..3x kernels, also has the .33 hang problem on multiple machines. So, the
> next question is, how many people that have been having the
> hang/reboot/general blow up and die problems also had memory leaks under
> earlier 2.0.3x kernels?

No, at least I didn't recognize. The machine I'm talking about is
idle most of the time, so maybe this is the reason.

pm

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu