kill -9 won't remove communicator, strace available!

Rob Hagopian (hagopiar@vuser.vu.union.edu)
Sun, 1 Feb 1998 23:08:55 -0500 (EST)


Ah! Finally, I got a strace off of a communicator process that died and
can't be removed via a kill -9.

This is a PPro 200 (Tyan Titan Pro), 256M RAM, tulip ethernet,
Trident VGA, and NCR SCSI cards (all PCI).

RedHat 5.0 with all current updates (glibc 2.0.6), but this has been
happening for a long time now (RedHat 4.2). Communicator 4.04 from the
RedHat RPM.

2.0.33 w/latest md driver (alpha, raid-5), and the TASK_UNINTERUPTABLE
moves in filemap.c and buffer.c suggested a while ago in the list (and by
Jens).

The strace output is available at:
ftp://ftp.vu.union.edu/pub/users/hagopiar/commtrace
ftp://ftp.vu.union.edu/pub/users/hagopiar/commtrace.gz
ftp://ftp.vu.union.edu/pub/users/hagopiar/commtrace.bz2
http://www.vu.union.edu/~hagopiar/commtrace
http://www.vu.union.edu/~hagopiar/commtrace.gz
http://www.vu.union.edu/~hagopiar/commtrace.bz2

We currently have 5 of these dead processes ('ps auxl'):
0 210 14801 1 0 0 13608 8208 wait_on_pag D ? 0:04 communica
0 210 20018 1 0 0 19032 14616 wait_on_pag D ? 1:16 /usr/lib/
0 210 5234 5195 0 0 16396 11284 wait_on_pag D ? 0:14 /usr/lib/
30 210 11901 11915 0 0 15432 10784 wait_on_pag D ? 0:22 /usr/lib/
30 210 17674 18506 0 0 19612 15260 wait_on_pag D ? 3:59 /usr/lib/

I don't know what swap they'd even be using, /proc/meminfo reports:
total: used: free: shared: buffers: cached:
Mem: 261898240 256212992 5685248 85307392 59269120 105189376
Swap: 133885952 217088 133668864
MemTotal: 255760 kB
MemFree: 5552 kB
MemShared: 83308 kB
Buffers: 57880 kB
Cached: 102724 kB
SwapTotal: 130748 kB
SwapFree: 130536 kB

So I presume it's a mapped file or such? Is there any way to pin down that
page that it's waiting for? I'm more than willing to run volumes of
(noninstrusive; this is a live machine) tests on the machine... This is
very frustrating, many thanks to anyone with suggestions!!!
-Rob H.