Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalization fromoom_badness()

From: David Rientjes
Date: Wed Aug 25 2010 - 06:25:38 EST


On Wed, 25 Aug 2010, KOSAKI Motohiro wrote:

> Current oom_score_adj is completely broken because It is strongly bound
> google usecase and ignore other all.
>

That's wrong, we don't even use this heuristic yet and there is nothing,
in any way, that is specific to Google.

> 1) Priority inversion
> As kamezawa-san pointed out, This break cgroup and lxr environment.
> He said,
> > Assume 2 proceses A, B which has oom_score_adj of 300 and 0
> > And A uses 200M, B uses 1G of memory under 4G system
> >
> > Under the system.
> > A's socre = (200M *1000)/4G + 300 = 350
> > B's score = (1G * 1000)/4G = 250.
> >
> > In the cpuset, it has 2G of memory.
> > A's score = (200M * 1000)/2G + 300 = 400
> > B's socre = (1G * 1000)/2G = 500
> >
> > This priority-inversion don't happen in current system.
>

You continually bring this up, and I've answered it three times, but
you've never responded to it before and completely ignore it. I really
hope and expect that you'll participate more in the development process
and not continue to reinterate your talking points when you have no answer
to my response.

You're wrong, especially with regard to cpusets, which was formally part
of the heuristic itself.

Users bind an aggregate of tasks to a cgroup (cpusets or memcg) as a means
of isolation and attach a set of resources (memory, in this case) for
those tasks to use. The user who does this is fully aware of the set of
tasks being bound, there is no mystery or unexpected results when doing
so. So when you set an oom_score_adj for a task, you don't necessarily
need to be aware of the set of resources it has available, which is
dynamic and an attribute of the system or cgroup, but rather the priority
of that task in competition with other tasks for the same resources.

_That_ is what is important in having a userspace influence on a badness
heursitic: how those badness scores compare relative to other tasks that
share the same resources. That's how a task is chosen for oom kill, not
because of a static formula such as you're introducing here that outputs a
value (and, thus, a priority) regardless of the context in which the task
is bound.

That also means that the same task is not necessarily killed in a
cpuset-constrained oom compared to a system-wide oom. If you bias a task
by 30% of available memory, which Kame did in his example above, it's
entirely plausible that task A should be killed because it's actual usage
is only 1/20th of the machine. When its cpuset is oom, and the admin has
specifically bound that task to only 2G of memory, we'd natually want to
kill the memory hogger, that is using 50% of the total memory available to
it.

> 2) Ratio base point don't works large machine
> oom_score_adj normalize oom-score to 0-1000 range.
> but if the machine has 1TB memory, 1 point (i.e. 0.1%) mean
> 1GB. this is no suitable for tuning parameter.
> As I said, proposional value oriented tuning parameter has
> scalability risk.
>

So you'd rather use the range of oom_adj from -17 to +15 instead of
oom_score_adj from -1000 to +1000 where each point is 68GB? I don't
understand your point here as to why oom_score_adj is worse.

But, yes, in reality we don't really care about the granularity so much
that we need to prioritize a task using 512MB more memory than another to
break the tie on a 1TB machine, 1/2048th of its memory.

> 3) No reason to implement ABI breakage.
> old tuning parameter mean)
> oom-score = oom-base-score x 2^oom_adj

Everybody knows this is useless beyond polarizing a task for kill or
making it immune.

> new tuning parameter mean)
> oom-score = oom-base-score + oom_score_adj / (totalram + totalswap)

This, on the other hand, has an actual unit (proportion of available
memory) that can be used to prioritize tasks amongst those competing for
the same set of shared resources and remains constant even when a task
changes cpuset, its memcg limit changes, etc.

And your equation is wrong, it's

((rss + swap) / (available ram + swap)) + oom_score_adj

which is completely different from what you think it is.

> but "oom_score_adj / (totalram + totalswap)" can be calculated in
> userland too. beucase both totalram and totalswap has been exporsed by
> /proc. So no reason to introduce funny new equation.
>

Yup, it definitely can, which is why as I mentioned to Kame (who doesn't
have strong feelings either way, even though you quote him as having these
strong objections) that you can easily convert oom_score_adj into a
stand-alone memory quantity (biasing or forgiving 512MB of a task's
memory, for example) in the context it is currently attached to with
simple arithemetic in userspace. That's why oom_score_adj is powerful.

> 4) totalram based normalization assume flat memory model.
> example, the machine is assymmetric numa. fat node memory and thin
> node memory might have another wight value.
> In other word, totalram based priority is a one of policy. Fixed and
> workload depended policy shouldn't be embedded in kernel. probably.
>

I don't know what this means, and this was your criticism before I changed
the denominator during the revision of the patchset, so it's probably
obsoleted. oom_score_adj always operates based on the proportion of
memory available to the application which is how the new oom killer
determines which tasks to kill: relative to the importance (if defined by
userspace) and memory usage compared to other tasks competing for it.

> Then, this patch remove *UGLY* total_pages suck completely. Googler
> can calculate it at userland!
>

Nothing specific about any of this to Google. Users who actually setup
their machines to use mempolicies, cpusets, or memcgs actually do want a
powerful interface from userspace to tune the priorities in terms of both
business goals and also importance of the task. That is done much more
powerfully now with oom_score_adj than the previous implementation. Users
who don't use these cgroups, especially desktop users, can see
oom_score_adj in terms of a memory quantity that remains static: they
aren't going to encounter changing memcg limits, cpuset mems, etc.

That said, I really don't know why you keep mentioning "Google this" and
"Google that" when the company I'm working for is really irrelevant to
this discussion.

With that, I respectfully nack your patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/