Re: [PATCH v2] mm: memcontrol: protect the memory in cgroup from being oom killed

From: 程垲涛 Chengkaitao Cheng
Date: Fri Dec 09 2022 - 00:07:32 EST


At 2022-12-08 22:23:56, "Michal Hocko" <mhocko@xxxxxxxx> wrote:
>On Thu 08-12-22 14:07:06, 程垲涛 Chengkaitao Cheng wrote:
>> At 2022-12-08 16:14:10, "Michal Hocko" <mhocko@xxxxxxxx> wrote:
>> >On Thu 08-12-22 07:59:27, 程垲涛 Chengkaitao Cheng wrote:
>> >> At 2022-12-08 15:33:07, "Michal Hocko" <mhocko@xxxxxxxx> wrote:
>> >> >On Thu 08-12-22 11:46:44, chengkaitao wrote:
>> >> >> From: chengkaitao <pilgrimtao@xxxxxxxxx>
>> >> >>
>> >> >> We created a new interface <memory.oom.protect> for memory, If there is
>> >> >> the OOM killer under parent memory cgroup, and the memory usage of a
>> >> >> child cgroup is within its effective oom.protect boundary, the cgroup's
>> >> >> tasks won't be OOM killed unless there is no unprotected tasks in other
>> >> >> children cgroups. It draws on the logic of <memory.min/low> in the
>> >> >> inheritance relationship.
>> >> >>
>> >> >> It has the following advantages,
>> >> >> 1. We have the ability to protect more important processes, when there
>> >> >> is a memcg's OOM killer. The oom.protect only takes effect local memcg,
>> >> >> and does not affect the OOM killer of the host.
>> >> >> 2. Historically, we can often use oom_score_adj to control a group of
>> >> >> processes, It requires that all processes in the cgroup must have a
>> >> >> common parent processes, we have to set the common parent process's
>> >> >> oom_score_adj, before it forks all children processes. So that it is
>> >> >> very difficult to apply it in other situations. Now oom.protect has no
>> >> >> such restrictions, we can protect a cgroup of processes more easily. The
>> >> >> cgroup can keep some memory, even if the OOM killer has to be called.
>> >> >>
>> >> >> Signed-off-by: chengkaitao <pilgrimtao@xxxxxxxxx>
>> >> >> ---
>> >> >> v2: Modify the formula of the process request memcg protection quota.
>> >> >
>> >> >The new formula doesn't really address concerns expressed previously.
>> >> >Please read my feedback carefully again and follow up with questions if
>> >> >something is not clear.
>> >>
>> >> The previous discussion was quite scattered. Can you help me summarize
>> >> your concerns again?
>> >
>> >The most important part is http://lkml.kernel.org/r/Y4jFnY7kMdB8ReSW@xxxxxxxxxxxxxx
>> >: Let me just emphasise that we are talking about fundamental disconnect.
>> >: Rss based accounting has been used for the OOM killer selection because
>> >: the memory gets unmapped and _potentially_ freed when the process goes
>> >: away. Memcg changes are bound to the object life time and as said in
>> >: many cases there is no direct relation with any process life time.
>> >
>> We need to discuss the relationship between memcg's mem and process's mem,
>>
>> task_usage = task_anon(rss_anon) + task_mapped_file(rss_file)
>> + task_mapped_share(rss_share) + task_pgtables + task_swapents
>>
>> memcg_usage = memcg_anon + memcg_file + memcg_pgtables + memcg_share
>> = all_task_anon + all_task_mapped_file + all_task_mapped_share
>> + all_task_pgtables + unmapped_file + unmapped_share
>> = all_task_usage + unmapped_file + unmapped_share - all_task_swapents
>
>You are missing all the kernel charged objects (aka __GFP_ACCOUNT
>allocations resp. SLAB_ACCOUNT for slab caches). Depending on the
>workload this can be really a very noticeable portion. So not this is
>not just about unmapped cache or shm.
>
Kmem is indeed missing here, thanks for reminding. but the patch is also applicable
when kmem is added.

>> >That is to the per-process discount based on rss or any per-process
>> >memory metrics.
>> >
>> >Another really important question is the actual configurability. The
>> >hierarchical protection has to be enforced and that means that same as
>> >memory reclaim protection it has to be enforced top-to-bottom in the
>> >cgroup hierarchy. That makes the oom protection rather non-trivial to
>> >configure without having a good picture of a larger part of the cgroup
>> >hierarchy as it cannot be tuned based on a reclaim feedback.
>>
>> There is an essential difference between reclaim and oom killer.
>
>oom killer is a memory reclaim of the last resort. So yes, there is some
>difference but fundamentally it is about releasing some memory. And long
>term we have learned that the more clever it tries to be the more likely
>corner cases can happen. It is simply impossible to know the best
>candidate so this is a just a best effort. We try to aim for
>predictability at least.

Is the current oom_score strategy predictable? I don't think so. The score_adj
has broken the predictability of oom_score (it is no longer simply killing the
process that uses the most mems). And I think that score_adj and oom.protect
are not for the kernel to choose the best candidate, but for the user to choose
the candidate more conveniently. If the user does not configure the score_adj
and oom.protect, the kernel will follow the simplest and most direct logic (killing
the process that uses the most mems).

>
>> The reclaim
>> cannot be directly perceived by users,
>
>I very strongly disagree with this statement. First the direct reclaim is a
>direct source of latencies because the work is done on behalf of the
>allocating process. There are side effect possible as well because
>refaults have their cost as well.

The "direct perception" here does not mean that reclaim will not affect the
performance of user processes, but emphasizes that users cannot make
feedback adjustments based on their own state and must rely on the help
of kernel indicators.
>
>> so memcg need to count indicators
>> similar to pgscan_(kswapd/direct). However, when the user process is killed
>> by oom killer, users can clearly perceive and count (such as the number of
>> restarts of a certain type of process). At the same time, the kernel also has
>> memory.events to count some information about the oom killer, which can
>> also be used for feedback adjustment.
>
>Yes we have those metrics already. I suspect I haven't made myself
>clear. I didn't say there are no measures to see how oom behaves. What
>I've said that I _suspect_ that oom protection would be really hard to
>configure correctly because unlike the memory reclaim which happens
>during the normal operation, oom is a relatively rare event and it is
>quite hard to use it for any feedback mechanisms.

We discussed similar cases,
https://lore.kernel.org/linux-mm/EF1DC035-442F-4BAE-B86F-6C6B10B4A094@xxxxxxxxxxxxxx/
* More and more users want to save costs as much as possible by setting the
* mem.max to a very small value, resulting in a small number of oom events,
* but users can tolerate them, and users want to minimize the impact of oom
* events at this time. In similar scenarios, oom events are no longer abnormal
* and unpredictable. We need to provide convenient oom policies for users to
* choose.

> But I am really open
>to be convinced otherwise and this is in fact what I have been asking
>for since the beginning. I would love to see some examples on the
>reasonable configuration for a practical usecase.

Here is a simple example. In a docker container, users can divide all processes
into two categories (important and normal), and put them in different cgroups.
One cgroup's oom.protect is set to "max", the other is set to "0". In this way,
important processes in the container can be protected.

> It is one thing to say
>that you can set the protection to a certain value and a different one
>to have a way to determine that value. See my point?

According to the current situation, if the score_adj is set, the only way for
the kernel to determine the value is "cat /proc/pid/oom_core". In the
oom.protect scheme, I also propose to change "/proc/pid/oom_core".
Please refer to the link,
https://lore.kernel.org/linux-mm/C2CC36C1-29AE-4B65-A18A-19A745652182@xxxxxxxxxxxxxx/

>
>--
>Michal Hocko
>SUSE Labs