Re: Crazy load average & unkillable processes

From: Nico Schottelius
Date: Thu Aug 28 2003 - 04:02:28 EST


Very interesting..
with the test4 I experiene the same/similar problems on my laptop..
all of sudden yesterday several programs died -> Out of Memory.
I ran
Xfree
dhcpcd
opera
several xterms (about 6)
qmail
named

first opera was Out of Memory, then died the whole X system with all
xterms and X beeing Out of Memory.

MemTotal: 385600 kB

which should be more than enough!

Nico


Ross Clarke [Thu, Aug 28, 2003 at 12:41:30AM +0200]:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Brandon wrote:
>
> |Hi Everyone,
> |
> | I'm having some bothersome problems with a couple servers of
> |mine. I'm hoping some of you have some advice on how to trouble shoot
> |this, because my little brain is running out of ideas.
> |
> |All the servers are running Redhat 7.3, 2.4.20-19smp kernels,
> |apache-1.3.27, and Soft Raid-1.
> |
> |Here is what is happening, all of a sudden the server load average
> |climbs real high. It climbs to 100+ within a few minutes, then
> |constantly grows after that. The last server that had this happen was
> |at 375 avg when I rebooted it, which always needs to be a hard reboot -
> |because the shutdown -r now command doesn't do anything.
> |
> |While this is happening, I can not run commands like 'ps fax', 'pstree',
> |'top', 'killall' etc without them hanging . Most other commands work. I
> |can SSH to the server no problem. If I do a 'ps ax' I can see a list of
> |processes, but it always hangs before displaying them all. I narrowed it
> |down to anything that needs a full process list hangs.
> |
> |I wrote a script that runs 'ls -la /proc/$P', and 'cat /proc/$P/cmdline'
> |on each process in /proc.
> |
> |What I found is the processes that hang ps and whatnot are all owned by
> |apache. The script hangs on the ls -la /proc/$P whenever it hits an
> |apache process. The processes it hangs on can not be killed with kill
> |-9. The number of apache owned processes was at 250, while on a regular
> |server it is only at 20 or so.
> |
> |Running sar -v shows the dentunusd grow huge at about the time of the
> |issues:
> |
> |04:30:00 PM dentunusd file-sz %file-sz inode-sz super-sz %super-sz
> |dquot-sz %dquot-sz rtsig-sz %rtsig-sz
> |05:30:00 PM 38823 25900 12.35 24755 0 0.00
> |0 0.00 7 0.68
> |05:40:00 PM 39757 25854 12.33 25054 0 0.00
> |0 0.00 7 0.68
> |05:50:00 PM 4294967057 23526 11.22 4303 0 0.00
> |0 0.00 18 1.76
> |
> |Also, the number of sockets grows by about 3X:
> |
> |4:30:00 PM totsck tcpsck udpsck rawsck ip-frag
> |04:40:00 PM 136 60 5 0 0
> |04:50:00 PM 112 35 5 0 0
> |05:00:00 PM 121 40 7 0 0
> |05:10:00 PM 126 44 5 0 0
> |05:20:00 PM 115 38 5 0 0
> |05:30:00 PM 119 36 8 0 0
> |05:40:00 PM 120 42 6 0 0
> |05:50:00 PM 526 236 5 0 1
> |06:00:00 PM 531 224 5 0 0
> |06:10:00 PM 535 224 5 0 0
> |
> |
> |That is just about all I have come up so far. If anyone has seen this,
> |or can recommend on what steps I should take next, I could certainly us
> |the advice.
> |
> |Thank you all
> |
> |Brandon Belshaw
> |
> |
> |
> |
> |
> |-
> |To unsubscribe from this list: send the line "unsubscribe linux-admin" in
> |the body of a message to majordomo@xxxxxxxxxxxxxxx
> |More majordomo info at http://vger.kernel.org/majordomo-info.html
> |
>
> I just had the same similiar problem twice with 2.6.0-test4, I also used
> to experience it on 2.4.18. I managed to get ps to list tho, before all
> commands stopped working, and I noticed many of the proccesses went into
> D and Z states. I beleive they were getting stuck in the I/O subsystem,
> my other filesystems were still responding since my XMMS didnt die till
> it hit an mp3 on my main filesystem, which was about 30 minutes after
> the problem started. Any currently open application was still working,
> until I tried to do anything that required I/O, then they died aswell.
> That last happened to me about 12 hours ago, and I had to recover my
> entire /home directory. I couldnt find out what cuased it, the first
> time it was MozillaFirebird that died first, the 2nd time it was vim.
> Also both times I tried hitting the power button to see if I could get
> any form of shutdown where the data would sync, both times the kernel
> OOPS'ed on the apmd event.
>
> Anybody got any ideas?
>
> Regards,
> Ross Clarke
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.2.2 (GNU/Linux)
> Comment: Using GnuPG with Mozilla Thunderbird - http://enigmail.mozdev.org
>
> iD8DBQE/TTOa1+7fkD/L8TgRAkmdAJ9ciSYT6tAQGT0Uk+RD7Y8gkbmEIwCffLIT
> z2SGntQl8+1sI1QRVFZtxho=
> =utNU
> -----END PGP SIGNATURE-----
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-admin" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

--
quote: there are two time a day you should do nothing: before 12 and after 12
(Nico Schottelius after writin' a very senseless email)
cmd: echo God bless America | sed 's/.*\(A.*\)$/Why \1?/'
pgp: new id: 0x8D0E27A4 | ftp.schottelius.org/pub/familiy/nico/pgp-key.new
url: http://nerd-hosting.net - domains for nerds (from a nerd)

Attachment: pgp00001.pgp
Description: PGP signature