diald problem & __sleep_on() and remove_wait_queue() bug?

Andrej Presern (andrejp@luz.fe.uni-lj.si)
Thu, 22 Jan 1998 17:45:36 +0100


Hello!

I've been fiddling with the problem where diald and su get stuck
whenever diald (chat really) is redialing constantly (busy line).
According to wchan both programs sleep in unix_connect() and I managed
to further narrow the blocking point to first call to
interruptible_sleep_on().

After looking at the code in kernel/sched.c I noticed that there seems
to be a sti() missing after the last __remove_wait_queue() in
__sleep_on() which also happens to be the case in remove_wait_queue() in
include/linux/sched.h. Is my observation correct? I've changed the
source to include the 'missing' sti()s but no effect.

Further symptomes include a slowly increasing number of chat connect
scripts (ps ax | grep chat) which goes along with a also slowly
increasing number of CONNECTING (state 02) connections in /proc/net/unix
(grep " 02 " /proc/net/unix), where the number of CONNECTING connections
is one less than the number of chat connect scripts.

Has anyone else experienced similar problems with diald or am I alone
here? There's another machine that I can access with completely
different setup (other admin, other distribution, no connection
whatsoever to my setup) that shows the same symptoms. I'm running a
2.0.33 and the other machine is 2.0.25 so this looks like a long
standing bug.

Andrej