Re: Communicator hung, kill -9 won't remove

Rob Hagopian (hagopiar@vuser.vu.union.edu)
Mon, 26 Jan 1998 16:19:13 -0500 (EST)


I did this (2.0.33 + md alpha patch + stuff below) and we got another hung
communicator today:

[root@vuser ~]$ ps aux 14801
USER PID %CPU %MEM SIZE RSS TTY STAT START TIME COMMAND
stepanep 14801 0.0 3.2 13608 8208 ? D Jan 25 0:04 communicator
[root@vuser ~]$ kill -9 14801
[root@vuser ~]$ ps aul 14801
FLAGS UID PID PPID PRI NI SIZE RSS WCHAN STA TTY TIME
COMMAND
0 210 14801 1 0 0 13608 8208 wait_on_pag D ? 0:04 communica
[root@vuser ~]$

Is there anything that can be done to get more info as to what
wait_for_page is waiting for and why it's not being released? Everything
below applies, except that now we're down to 2 SCSI cards for the
moment... (<sigh>)
-Rob H.

On Tue, 30 Dec 1997, Jens Maurer wrote:

> Hi,
>
> recently, there was a discussion on linux-kernel regarding
> hangs with the md driver.
>
> 2.1.76 contains a (possible) fix for this, but it looks
> like it could be easily backported to 2.0.3x.
>
> Go to linux/mm/filemap.c:__wait_on_page() and move
> the "current->state = TASK_UNINTERRUPTIBLE" assignment
> just after the repeat: label, before the run_task_queue()
> call. Same in linux/fs/buffer.c:__wait_on_buffer().
> Yes, that's moving the assignment just one line up.
>
> Recompile your kernel (don't forget to install the new
> System.map).
>
> Jens.
>
>
> Rob Hagopian wrote:
> > One may remember (probably not) many weeks ago I had a user that was
> > complaining of hung communicator processes that kill -9 wouldn't remove...
> > Well, he had another, but, of course, this was the one time he wasn't
> > running it through strace (<sigh>). However, we do have a good System.map
> > now, and we're running 2.0.32, so here's tidbits (note the ps -l says
> > it's in wait_on_page):
> >
> > [root@vuser ~]$ id
> > uid=0(root) gid=0(root)
> > groups=0(root),1(bin),2(daemon),3(sys),4(adm),6(disk),10(wheel)
> > [root@vuser ~]$ ps -auxww | grep communicator | grep -v grep
> > stepanep 14647 1.0 17.5 26260 22152 ? D 19:47 3:21 communicator
> > [root@vuser ~]$ kill -9 14647
> > [root@vuser ~]$ ps -auxww | grep communicator | grep -v grep
> > stepanep 14647 1.0 17.5 26260 22152 ? D 19:47 3:21 communicator
> > [root@vuser ~]$ strace -p14647
> > attach: ptrace(PTRACE_ATTACH, ...): Operation not permitted
> > [root@vuser ~]$ ps -auxwwl | grep communicator | grep -v grep
> > 10 210 14647 24977 0 0 26260 22152 wait_on_pag D ? 3:21
> > communicator
> > [root@vuser ~]$
> >
> > So the question is: what's going on here? This is:
> > Dual PPro (Tyan MBoard)
> > 128MB RAM 127.6875MB Swap
> > 3 NCR825 SCSI boards each controlling 2 Quantum ST drives in a RAID-5
> > 1 Western Digital 4.0G IDE drive (/, /boot, and 1 128MB swap partition)
> > DLink ethernet card (tulip driver)
> > Trident PCI video card
> >
> > Kernel 2.0.32 + latest MD driver
>
>