Re: [PATCH v3 0/2] memcontrol: support cgroup level OOM protection

From: Michal Hocko
Date: Mon May 22 2023 - 09:03:58 EST


[Sorry for a late reply but I was mostly offline last 2 weeks]

On Tue 09-05-23 06:50:59, 程垲涛 Chengkaitao Cheng wrote:
> At 2023-05-08 22:18:18, "Michal Hocko" <mhocko@xxxxxxxx> wrote:
[...]
> >Your cover letter mentions that then "all processes in the cgroup as a
> >whole". That to me reads as oom.group oom killer policy. But a brief
> >look into the patch suggests you are still looking at specific tasks and
> >this has been a concern in the previous version of the patch because
> >memcg accounting and per-process accounting are detached.
>
> I think the memcg accounting may be more reasonable, as its memory
> statistics are more comprehensive, similar to active page cache, which
> also increases the probability of OOM-kill. In the new patch, all the
> shared memory will also consume the oom_protect quota of the memcg,
> and the process's oom_protect quota of the memcg will decrease.

I am sorry but I do not follow. Could you elaborate please? Are you
arguing for per memcg or per process metrics?

[...]

> >> In the final discussion of patch v2, we discussed that although the adjustment range
> >> of oom_score_adj is [-1000,1000], but essentially it only allows two usecases
> >> (OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably. Everything in between is
> >> clumsy at best. In order to solve this problem in the new patch, I introduced a new
> >> indicator oom_kill_inherit, which counts the number of times the local and child
> >> cgroups have been selected by the OOM killer of the ancestor cgroup. By observing
> >> the proportion of oom_kill_inherit in the parent cgroup, I can effectively adjust the
> >> value of oom_protect to achieve the best.
> >
> >What does the best mean in this context?
>
> I have created a new indicator oom_kill_inherit that maintains a negative correlation
> with memory.oom.protect, so we have a ruler to measure the optimal value of
> memory.oom.protect.

An example might help here.

> >> about the semantics of non-leaf memcgs protection,
> >> If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally
> >> calculate the new effective oom_protect quota based on non-leaf memcg's quota.
> >
> >So the non-leaf memcg is never used as a target? What if the workload is
> >distributed over several sub-groups? Our current oom.group
> >implementation traverses the tree to find a common ancestor in the oom
> >domain with the oom.group.
>
> If the oom_protect quota of the parent non-leaf memcg is less than the sum of
> sub-groups oom_protect quota, the oom_protect quota of each sub-group will
> be proportionally reduced
> If the oom_protect quota of the parent non-leaf memcg is greater than the sum
> of sub-groups oom_protect quota, the oom_protect quota of each sub-group
> will be proportionally increased
> The purpose of doing so is that users can set oom_protect quota according to
> their own needs, and the system management process can set appropriate
> oom_protect quota on the parent non-leaf memcg as the final cover, so that
> the system management process can indirectly manage all user processes.

I guess that you are trying to say that the oom protection has a
standard hierarchical behavior. And that is fine, well, in fact it is
mandatory for any control knob to have a sane hierarchical properties.
But that doesn't address my above question. Let me try again. When is a
non-leaf memcg potentially selected as the oom victim? It doesn't have
any tasks directly but it might be a suitable target to kill a multi
memcg based workload (e.g. a full container).

> >All that being said and with the usecase described more specifically. I
> >can see that memcg based oom victim selection makes some sense. That
> >menas that it is always a memcg selected and all tasks withing killed.
> >Memcg based protection can be used to evaluate which memcg to choose and
> >the overall scheme should be still manageable. It would indeed resemble
> >memory protection for the regular reclaim.
> >
> >One thing that is still not really clear to me is to how group vs.
> >non-group ooms could be handled gracefully. Right now we can handle that
> >because the oom selection is still process based but with the protection
> >this will become more problematic as explained previously. Essentially
> >we would need to enforce the oom selection to be memcg based for all
> >memcgs. Maybe a mount knob? What do you think?
>
> There is a function in the patch to determine whether the oom_protect
> mechanism is enabled. All memory.oom.protect nodes default to 0, so the function
> <is_root_oom_protect> returns 0 by default.

How can an admin determine what is the current oom detection logic?

--
Michal Hocko
SUSE Labs