Re: Sporadious hang on 2.0.3[0,1,2,3,4pre2]

Doug Ledford (dledford@dialnet.net)
Wed, 04 Mar 1998 14:51:09 -0600


G. Sumner Hayes wrote:
>
> Chris Evans <chris@ferret.lmh.ox.ac.uk> wrote:
> >
> > On Wed, 4 Mar 1998, G. Sumner Hayes wrote:
> >
> > > I'm in the process of switching over to 2.1.x, but if I could get the
> > > debugging patches I'd love to help figure out what the problem in 2.0.x is.
> >
> > A few things have been suggested. Hopefully a 2.0.34pre3 will be out soon
> > addressing these. IP masq was one possibly problem..
>
> Yes, I've been following the discussion. I don't use IP masq (and it's not
> built into my kernels), so it's probably not the culprit in my case. I
> also have turned of SKB_LARGE and PCI bridge optimizations; neither of those
> helped. gcc 2.7.2.3, 2.8.0, and egcs-1.0.1 all show this behavior. (I was
> hoping that fiddling with compilers might move code around just enough to
> hide the problem).
>
> If I could find the aforementioned debugging patches, I'd love to apply them
> and see if I can help sort things out; mysterious hangs with no oops or
> other information (shift-scrollock and other magic keys don't work) aren't
> much to go on, so I can't be helpful until I find some way to gather more
> information.

So far, out of all the .config files people have sent me, and all the
various ways people have tried testing different kernels. etc. I have not
found a single common thread amongst all of the configurations. The only
thing that is really sticking in my mind right now is this. 2.0.29 didn't
have these problems (at least, not reportedly in the net code). The net
code in 2.0.30 - 33 looks like it should work without these problems, so I
surmise something is happening behind the sceenes. The networking is the
only similarity between these crashes, etc. The net code makes heavy use of
kmalloc/kfree for net buffers, etc. Other portions of kernel code make far
lighter use of these functions (eg., the SCSI subsystem almost always
pre-allocates all of it's memory at bootup and doesn't make any
kmalloc/kfree calls after the system is up and running). There was a memory
leak fixed in 2.0.31, which if I recall correctly only showed up because of
another bug fix in 2.0.30. There was another memory leak fixed in 32 or
33. My theory is that one or both of these memory leak fixes were missing
some bizarre case in which the memory *shouldn't* have been freed. The
result of such pre-mature freeing of memory is that some other code piece
still has a pointer referrencing the memory. Then the net code, due to its
heavy usage of kmalloc(), gets allocated that memory and starts to set it up
as an skbuff or rtbuf or some other item. Then the item that has a pointer
to freed memory wakes up, writes to the skbuff/whatever and trashes out part
of the newly alloced and inited buffer, resulting in skbuff/whatever
corruption that grinds the machine to a halt. This would explain the net
code getting hit with this generic bug when other things don't get hit so
bad, although we have the occasional, but much more rare, report of fs
problems as well that could be inode allocation related. It would also
explain the rough numbers of people having this problem. Most notably,
Daniel Ryde, who had some of the most consistent memory leaks in the early
.3x kernels, also has the .33 hang problem on multiple machines. So, the
next question is, how many people that have been having the
hang/reboot/general blow up and die problems also had memory leaks under
earlier 2.0.3x kernels?

-- 

Doug Ledford <dledford@dialnet.net> Opinions expressed are my own, but they should be everybody's.

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu