On Thu, 13 Jun 2002, Federico Sevilla III wrote:
> I was able to log on to the server with enough time to SIGKILL the
> xfs-daemon process. Unfortunately this wasn't good enough. The server
> started running up various processes stuck in "D" state, the OOM killer
> went on panic mode and started killing things left and right, mostly from
> what I notice apache and apache-ssl processes with messages like "Out of
> Memory: Killed process xxxx (apache)". I was also able to do a `free`
> after killing xfs-daemon and noticed that there was a lot of free memory
> both physically and on swap.
>
> Within less than ten minutes on this relatively lightly-loaded server, I
> could not log in to the machine locally, even as root (whose home
> directory is not NFS-exported) and load levels shot up to around 25, which
> is definitely abnormal. Existing logged-on processes also got stuck in
> whatever they were doing (`ps ax`, in particular is what I remember).
It has always puzzled me that a process using lots of memory can bring
down an entire (otherwise relatively idle) server to the extent that one
cannot even get mingetty on a local terminal to respond to keypresses. I
can confirm that the described situation is not just a one-off.
It is my experience that a single process using large amounts of memory
causes the system to start swapping. Once it starts swapping, every
process that does anything (apart from indefinite wait) goes into "I'm
ready to do some processing, but my memory is swapped out" state, and the
whole system collapses.
I don't know if Linux prioritises swap-ins as well as CPU time, but
perhaps this would be a good way of stabilising this circumstance. One
could arrange the dynamic process priorities so that the bad neighbour
(read: exploited xfs-daemon) gets less than its fair share of physical
RAM, therefore letting the rest of the system live. Perhaps one could
limit the number of processes that are having their swap-in serviced at
any one time, and make the ones that are being serviced the ones with the
highest dynamic priority.
I don't know what exactly is causing the problem, but it is possible that
pages are being paged in, the process runs on them for a quantum, and then
paged out. In this case, it might help to dictate a minimum time a page
can be "in", measured in the amount of work the owning process manages to
do.
My other quibble is that the out-of-memory killer seems to choose
processes fairly randomly. Ideally, it should kill the process that is
causing the problem - which in my experience is always the biggest process
in the system, or the process which has grown the fastest recently.
However, it should probably be positively discouraged from killing things
like apache (where most of the process space is shared), or gettys (which
are firstly small, and secondly likely to be respawned by someone at the
first opportunity).
> Attempted reboots locally via Ctrl-Alt-Delete and Magic SysRq failed
> because (1) I had disabled ctrl-alt-delete mapping in inittab "for
> security", and (2) I had not compiled Magic SysRq support on this
> particular server. More notes to self.
I think that if you had not disabled ctrl-alt-delete, it wouldn't have
done much good anyway, since usually that key combination just tells init
to spawn a shutdown process, which has just as much chance of getting
resources as any other process in the system.
> While I agree with BugTraq posts in response to this that applications
> like Mozilla which accept font-sizes from unknown sources should have some
> check to prevent such large sizes from crashing X and/or the X Font
> Server, I'm alarmed by (1) the way the X font server allows itself to be
> crashed like this, and (2) the way the entire Linux kernel seems to have
> been unable to handle the situation.
I am in complete agreement with both points, but particularly that the
kernel should be able to cope with the situation gracefully. I know one
can set limits on processes, but the kernel should still be able to cope
if we don't.
> Suggestions on how to work around this on multiple levels would definitely
> be appreciated.
My suggestion would be to set a maximum core size for the xfs-daemon
process. Given your setup, something like 400MB seems appropriate - maybe
a little lower. Details for doing this seem to differ from linux to linux.
Having done that, I would make sure xfs respawns when it dies - that way
someone can't just DOS your whole network by asking for a large font.
Finally, wait for the xfs developers to put in a font size limit, or patch
the source yourself.
Apart from that, I wouldn't move xfs to a bigger server unless you have
already had people complaining about its performance. Moving it to a
bigger system just changes the constant in the equation - the attacker
would only need to specify a 100000 point font instead of a 50000 point
font to bring the system down.
I doubt any of my kernel suggestions have not already been thought about,
but it was worth a try. Please can this problem be fixed soon?
Matthew
-- "If I wanted to kill a battleship, I'd use a s?!tload of Harpoons." "NT is a lot cheaper." -- Paul Tomblin & Petro- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
This archive was generated by hypermail 2b29 : Sat Jun 15 2002 - 22:00:29 EST