corrupted wait_queue entries ( sock->wait )

Henner Eisen (eis@baty.hanse.de)
Thu, 21 Aug 1997 01:55:04 +0200


Hi,

with recent 2.1.x kernels I am encountering a bug where the system
totally locks up during a select system call on socket fd's. Usually, there
is no diagnostic output at all.

After doing some tricks to get a least a console oops message, it turned
out that there is a NULL pointer dereference inside free_wait() which is called
at the end of do_select(). When free_wait() tries to remove wait queue entries
from the linked lists (inlined function remove_wait_queue()) it accesses
a NULL next field (which should never happen because the wait queue lists
are circular).

I've inserted printk()'s in sock_alloc() and sock_release() in order
to log their invocations and to report the address and value of the
sock->wait field. It turned out that

a socket was released but

free_wait() tried to access the sock->wait field of this
released socket later.

Seems that the struct socket storage (which is part of the inode) is already
used for something else when free_wait() is called.

Well, I was using some experimental stuff (x25 sockets on top of isdn, some
own enhancements to the isdn network interfaces to allow for this, and the
application triggering the problem was the x25-based telnetd from the
x25-utility package). However, maybe some other total lock up's have
similar reasons.

As oops messages are output using printk(), and printk() accesses wait queues
to wake up the log process (and in addition, klogd will communicate with
syslogd via sockets), the printout of the oops messages won't complete
successfully when the log wait queue is corrupted.

I've removed the wake_up_interruptible() call at the very end of printk().
Like this, printk() won't trigger additional accesses to corrupted wait
queues. This leaves useful oops messages at least at the console. But kernel
messages won't be written to your disk any longer. Thus, no oops log included.

Maybe, if you experience similar total lock up's without any diagnostic output,
you might also try to remove the wake_up_..() (and direct the console log
to your current vt using the klogconsole command).

Henner