I've been fiddling with the problem where diald and su get stuck
whenever diald (chat really) is redialing constantly (busy line).
According to wchan both programs sleep in unix_connect() and I managed
to further narrow the blocking point to first call to
interruptible_sleep_on().
After looking at the code in kernel/sched.c I noticed that there seems
to be a sti() missing after the last __remove_wait_queue() in
__sleep_on() which also happens to be the case in remove_wait_queue() in
include/linux/sched.h. Is my observation correct? I've changed the
source to include the 'missing' sti()s but no effect.
Further symptomes include a slowly increasing number of chat connect
scripts (ps ax | grep chat) which goes along with a also slowly
increasing number of CONNECTING (state 02) connections in /proc/net/unix
(grep " 02 " /proc/net/unix), where the number of CONNECTING connections
is one less than the number of chat connect scripts.
Has anyone else experienced similar problems with diald or am I alone
here? There's another machine that I can access with completely
different setup (other admin, other distribution, no connection
whatsoever to my setup) that shows the same symptoms. I'm running a
2.0.33 and the other machine is 2.0.25 so this looks like a long
standing bug.
Andrej