Re: [RFC 0/4] memcg: Low-limit reclaim

From: Michal Hocko
Date: Thu Jan 30 2014 - 07:30:54 EST


On Wed 29-01-14 11:08:46, Greg Thelen wrote:
[...]
> The series looks useful. We (Google) have been using something similar.
> In practice such a low_limit (or memory guarantee), doesn't nest very
> well.
>
> Example:
> - parent_memcg: limit 500, low_limit 500, usage 500
> 1 privately charged non-reclaimable page (e.g. mlock, slab)
> - child_memcg: limit 500, low_limit 500, usage 499

I am not sure this is a good example. Your setup basically say that no
single page should be reclaimed. I can imagine this might be useful in
some cases and I would like to allow it but it sounds too extreme (e.g.
a load which would start trashing heavily once the reclaim starts and it
makes more sense to start it again rather than crowl - think about some
mathematical simulation which might diverge).

> If a streaming file cache workload (e.g. sha1sum) starts gobbling up
> page cache it will lead to an oom kill instead of reclaiming.

Does it make any sense to protect all of such memory although it is
easily reclaimable?

> One could
> argue that this is working as intended because child_memcg was promised
> 500 but can only get 499. So child_memcg is oom killed rather than
> being forced to operate below its promised low limit.
>
> This has led to various internal workarounds like:
> - don't charge any memory to interior tree nodes (e.g. parent_memcg);
> only charge memory to cgroup leafs. This gets tricky when dealing
> with reparented memory inherited to parent from child during cgroup
> deletion.

Do those need any protection at all?

> - don't set low_limit on non leafs (e.g. do not set low limit on
> parent_memcg). This constrains the cgroup layout a bit. Some
> customers want to purchase $MEM and setup their workload with a few
> child cgroups. A system daemon hands out $MEM by setting low_limit
> for top-level containers (e.g. parent_memcg). Thereafter such
> customers are able to partition their workload with sub memcg below
> child_memcg. Example:
> parent_memcg
> \
> child_memcg
> / \
> server backup

I think that the low_limit makes sense where you actually want to
protect something from reclaim. And backup sounds like a bad fit for
that.

> Thereafter customers often want some weak isolation between server and
> backup. To avoid undesired oom kills the server/backup isolation is
> provided with a softer memory guarantee (e.g. soft_limit). The soft
> limit acts like the low_limit until priority becomes desperate.

Johannes was already suggesting that the low_limit should allow for a
weaker semantic as well. I am not very much inclined to that but I can
leave with a knob which would say oom_on_lowlimit (on by default but
allowed to be set to 0). We would fallback to the full reclaim if
no groups turn out to be reclaimable.
--
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/