(A copy of this message has also been posted to the following newsgroups:
comp.os.linux.development.system)
[2.2.13+Solar-Designer's patches, libc5, AMD-K6-200, 64MB physical+64MB
swap, started off with Slack 4.0]
Well, I get into work and my very hard-working and faithful Linux server
which runs my entire organisation showed a large number of dead services.
At first I suspected the worst - HACKERS. klogd was dead, as were nmbd
[Netbios name services], my checkups monitor, powerd, rpc.mountd,
rpc.nfsd, and atalkd [AppleTalk protocol daemon for the netatalk file
server].
The system was still operational of course, but limping along a bit. I
restarted the dead services without incident, and immediately scrutinised
my logs. There were the usual hacker scans and probes of port 137 and
1080 - but no signs of intrusions, modified kernel, libraries, or system
binaries.
I was still very nervous - and went to my router logs, and didn't see
anything that would reveal signs of a successful break-in. No bizarre net
connections or funky network activity at all during the night, other than
the port 137 stuff.
[I have a set of default deny inbound firewall rules in force which block
just about everything except web, ftp, ssh, very limited telnet, a secure
POP3, and of course SMTP [patched sendmail 8.9.3 to remove the alias
rebuild DoS] ]
I was just calming down but still rather perplexed when I typed "dmesg"
just to see if the kernel barfed any messages about the process
closures... and I saw the final clue which unravelled the puzzle:
Out of memory for nmbd.
Out of memory for klogd.
Out of memory for atalkd.
Out of memory for checkups.
Out of memory for powerd.
Out of memory for rpc.mountd.
Out of memory for rpc.nfsd.
Out of memory for tiff2bin.
Aha! tiff2bin is a utility I wrote to rapidly print faxes to HP Laserjet
printers at extremely high speed by simply converting the tiff images to
200x200 dpi [line doubling the low quality faxes], and then
raster-compressing them using HP's FASST! algorithms. It works great,
however, the TIFF library is a bit of a bitch to deal with so I use the
"read entire file into core" method.
These in-core images can require over 8MB chunks of memory to convert.
Confirmation of my incoming fax logs show that there was indeed a fax [I
use HylaFax 4.1beta2] at the time of this meltdown.
I run squid 2.2 STABLE 5, and it was using nearly 18MB of memory at the
time. It had died with a core swap failure at about this time as well.
My box's "resting" memory usage is as follows [with 10MB used by squid]:
Memory: Total Used Free Shared Buffers Cached
Mem: 63312 60580 2732 9196 21048 20076
Swap: 65988 25004 40984
I don't read all of that to mean I was RAM challenged, however... with
all of this floating buffer/cache stuff, it can be hard to tell at times.
Now, my $64,000 question:
Why didn't the system kill the process that was being the pig (tiff2bin)?
It seemed to kill processes "early" in the process table to free up memory
or something. powerd and checkups don't allocate memory once they've
started up, so they didn't die because they were attempting to allocate
memory and failed. And klogd! I've never seen klogd or syslogd ever die
like this.
It seems like the kernel just up and decided to clean house or something,
axing whichever processes seemed expedient.
Hell, I don't mind it killing off processes, but WHY NOT HAVE IT KILL
USER/NON-DAEMON processes [or at the very least,
non-wheel-grouped/UID=0]? This was a real pisser of a "mini-crash".
Amazingly, most of the remaining services were OK despite no klogd etc.
I have a separate logfile for kernel alerts [which is solely used by the
Solar Designer no-stack-exec and other security log entries], and it's
been empty and devoid of problems for the box's entire 33 day uptime. So
I highly doubt it's a problem with his patches. [I do not have
auto-trampolines enabled, but I've yet to see a daemon or piece of code on
my box which requires them].
My kernel is gcc2.7.2.3 compiled, and it is conservatively configured
[very little experimental code if any, other than the Solar D stuff].
I am VERY worried that one (non-root) process requesting a large chunk of
memory could cause several vital system services to fail. I also find it
extremely worrisome that the memory hog was the last process to get the
chop. To me this is counter-intuitive. I'd think the kernel would nix
the "youngest" non-root process as being the most anti-social. It killing
klogd was the biggest surprise of all.
Is there a "memory fragmentation" issue with Linux 2.2 memory management
that can sometimes arise in pathological cases? It'd be extremely
difficult for me to replicate the conditions that lead to this situation
on my system. It's seen heavy squid, Macintosh as well as Windows file
sharing, and the usual things a modern net-using org does to an Intranet
box.
Any ideas or explanations (or especially recommendations!) would be very
welcome,
=Rob=
-- The reply-to-address is real and will expire on 12:01AM 1-Feb-2000. Spammers: You will lose your network access. Guaranteed. 102 domains, 376 web-accounts, and 568 dialup ISP accounts flushed.- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
This archive was generated by hypermail 2b29 : Sat Jan 15 2000 - 21:00:16 EST