Re: [PATCH] Revert oom rewrite series

From: Bodo Eggert
Date: Mon Nov 15 2010 - 18:33:52 EST


On Sun, 14 Nov 2010, David Rientjes wrote:

Also, stating that the new heuristic doesn't address CAP_SYS_RESOURCE
approrpiately isn't a bug report, it's the desired behavior. I eliminated
all of the arbitrary heursitics in the old heuristic that we had the
remove internally as well so that is predictable as possible and achieves
the oom killer's sole goal: to kill the most memory-hogging task that is
eligible to allow memory allocations in the current context to succeed.

CAP_SYS_RESOURCE threads have full control over their oom killing priority
by /proc/pid/oom_score_adj

, but unless they are written in the last months and designed for linux
and if the author took some time to research each external process invocation, they can not be aware of this possibility.

Besides that, if each process is supposed to change the default, the default is wrong.

and need no consideration in the heuristic by
default since it otherwise allows for the probability that multiple tasks
will need to be killed when a CAP_SYS_RESOURCE thread uses an egregious
amount of memory.

If it happens to use an egregious mount of memory, it SHOULD score
enough to get killed.

The problem is, DavidR patches don't refrect real world usecase at all
and breaking them. He can talk about the userland is wrong. but such
excuse doesn't solve real world issue. it makes no sense.

As mentioned just a few minutes ago in another thread, there is no
userspace breakage with the rewrite and you're only complaining here about
the deprecation of /proc/pid/oom_adj for a period of two years. Until
it's removed in 2012 or later, it maps to the linear scale that
oom_score_adj uses rather than its old exponential scale that was
unusable for prioritization because of (1) the extremely low resolution,
and (2) the arbitrary heuristics that preceeded it.

1) The exponential scale did have a low resolution.

2) The heuristics were developed using much brain power and much
trial-and-error. You are going back to basics, and some people
are not convinced that this is better. I googled and I did not
find a discussion about how and why the new score was designed
this way.
looking at the output of:
cd /proc; for a in [0-9]*; do
echo `cat $a/oom_score` $a `perl -pes/'\0.*$'// < $a/cmdline`;
done|grep -v ^0|sort -n |less
, I 'm not convinced, too.

PS) Mapping an exponential value to a linear score is bad. E.g. A
oom_adj of 8 should make an 1-MB-process as likely to kill as
a 256-MB-process with oom_adj=0.

PS2) Because I saw this in your presentation PDF: (@udev-people)
The -17 score of udevd is wrong, since it will even prevent
the OOM killer from working correctly if it grows to 100 MB:

It's default OOM score is 13, while root's shell is at 190
and some KDE processes are at 200 000. It will not get killed
under normal circumstances.

If it udevd grows enough to score 190 as well, it has a bug
that causes it to eat memory and it needs to be killed. Having
a -17 oom_adj, it will cause the system to fail instead.
Considering udevd's size, an adj of -1 or -2 should be enough on
embedded systems, while desktop systems should not need it.
If you are worried about udevd getting killed, protect ist using
a wrapper.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/