Re: [PATCH] fix spurious OOM kills

From: Martin MOKREJŠ
Date: Sun Nov 21 2004 - 07:18:52 EST


Thomas Gleixner wrote:
On Sat, 2004-11-20 at 22:19 +0100, Martin MOKREJŠ wrote:

It should only kill RNAsubopt and bash and touch nothing else.

Yes, that's true, this patch has helped. Actually the other xterm got
closed, but that's because bash is the controlling application of it.
I believe that's expected.

I'd prefer to get only RNAsubopt killed. ;)


Ok. To kill only RNAsubopt it might be neccecary to refine the criteria
in the whom to kill selection.

Why can't the algorithm first find the asking for memory now.
When found, kernel should kill first it's children, wait some time,
then kill this process if still exists (it might exit itself when children
get closed).

You have said it's safer to kill that to send ENOMEM as happens
in 2.4, but I still don't undertand why kernel first doesn't send
ENOMEM, and only if that doesn't help it can start after those 5 seconds
OOM killer, and try to kill the very same application.
I don't get the idea why to kill immediately.

As it has happened to me in the past, that random OOM selection has killed
sshd or init, I believe the algorithm should be improved to not to try
to kill these. First of all, sshd is well tested, so it will never
be source of memleaks. Second, if the algorithm would really insist on
killing either of these, I personally prefer it rather do clean reboot
than a system in a state without sshd. I have to get to the console.
Actually, it's several kilometers for me. :(



And still, there weren't
that many changes to memory management between 2.6.9-rc1 and 2.6.9-rc2. ;)
I can test those VM changes separately, if someone would provide me with
those changes split into 2 or 3 patchsets.


The oom trouble was definitly not invented there. The change between
2.6.9-rc1 and rc2 is justs triggering your special testcase.

Other testcases show the problems with earlier 2.6 kernels too.

It's a pitty no-one has time to at least figure out why those changes have
exposed this stupid random part of the algorithm. Before 2.6.9-rc2
OOM killer was also started in my tests, but it worked deterministically.
I wouldn't prefer extra algorithm to check what we kill now, I'd rather look
why we kill randomly since -rc2.

Of course your patch is still helpfull, sure.
Martin


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/