Re: [PATCH v4 0/2] memcontrol: support cgroup level OOM protection

From: 程垲涛 Chengkaitao Cheng
Date: Thu May 25 2023 - 04:20:20 EST

At 2023-05-24 06:02:55, "Yosry Ahmed" <yosryahmed@xxxxxxxxxx> wrote:
>On Sat, May 20, 2023 at 2:52 AM 程垲涛 Chengkaitao Cheng
><chengkaitao@xxxxxxxxxxxxxx> wrote:
>> At 2023-05-20 06:04:26, "Yosry Ahmed" <yosryahmed@xxxxxxxxxx> wrote:
>> >On Wed, May 17, 2023 at 10:12 PM 程垲涛 Chengkaitao Cheng
>> ><chengkaitao@xxxxxxxxxxxxxx> wrote:
>> >>
>> >> At 2023-05-18 04:42:12, "Yosry Ahmed" <yosryahmed@xxxxxxxxxx> wrote:
>> >> >On Wed, May 17, 2023 at 3:01 AM 程垲涛 Chengkaitao Cheng
>> >> ><chengkaitao@xxxxxxxxxxxxxx> wrote:
>> >> >>
>> >> >> At 2023-05-17 16:09:50, "Yosry Ahmed" <yosryahmed@xxxxxxxxxx> wrote:
>> >> >> >On Wed, May 17, 2023 at 1:01 AM 程垲涛 Chengkaitao Cheng
>> >> >> ><chengkaitao@xxxxxxxxxxxxxx> wrote:
>> >> >> >>
>> >> >>
>> >> >> Killing processes in order of memory usage cannot effectively protect
>> >> >> important processes. Killing processes in a user-defined priority order
>> >> >> will result in a large number of OOM events and still not being able to
>> >> >> release enough memory. I have been searching for a balance between
>> >> >> the two methods, so that their shortcomings are not too obvious.
>> >> >> The biggest advantage of memcg is its tree topology, and I also hope
>> >> >> to make good use of it.
>> >> >
>> >> >For us, killing processes in a user-defined priority order works well.
>> >> >
>> >> >It seems like to tune memory.oom.protect you use oom_kill_inherit to
>> >> >observe how many times this memcg has been killed due to a limit in an
>> >> >ancestor. Wouldn't it be more straightforward to specify the priority
>> >> >of protections among memcgs?
>> >> >
>> >> >For example, if you observe multiple memcgs being OOM killed due to
>> >> >hitting an ancestor limit, you will need to decide which of them to
>> >> >increase memory.oom.protect for more, based on their importance.
>> >> >Otherwise, if you increase all of them, then there is no point if all
>> >> >the memory is protected, right?
>> >>
>> >> If all memory in memcg is protected, its meaning is similar to that of the
>> >> highest priority memcg in your approach, which is ultimately killed or
>> >> never killed.
>> >
>> >Makes sense. I believe it gets a bit trickier when you want to
>> >describe relative ordering between memcgs using memory.oom.protect.
>> Actually, my original intention was not to use memory.oom.protect to
>> achieve relative ordering between memcgs, it was just a feature that
>> happened to be achievable. My initial idea was to protect a certain
>> proportion of memory in memcg from being killed, and through the
>> method, physical memory can be reasonably planned. Both the physical
>> machine manager and container manager can add some unimportant
>> loads beyond the oom.protect limit, greatly improving the oversold
>> rate of memory. In the worst case scenario, the physical machine can
>> always provide all the memory limited by memory.oom.protect for memcg.
>> On the other hand, I also want to achieve relative ordering of internal
>> processes in memcg, not just a unified ordering of all memcgs on
>> physical machines.
>For us, having a strict priority ordering-based selection is
>essential. We have different tiers of jobs of different importance,
>and a job of higher priority should not be killed before a lower
>priority task if possible, no matter how much memory either of them is
>using. Protecting memcgs solely based on their usage can be useful in
>some scenarios, but not in a system where you have different tiers of
>jobs running with strict priority ordering.

If you want to run with strict priority ordering, it can also be achieved,
but it may be quite troublesome. The directory structure shown below
can achieve the goal.

/ \
cgroup A cgroup B
(protect=max) (protect=0)
/ \
cgroup C cgroup D
(protect=max) (protect=0)
/ \
cgroup E cgroup F
(protect=max) (protect=0)

Oom kill order: F > E > C > A

As mentioned earlier, "running with strict priority ordering" may be
some extreme issues, that requires the manager to make a choice.

>> >> >In this case, wouldn't it be easier to just tell the OOM killer the
>> >> >relative priority among the memcgs?
>> >> >
>> >> >>
>> >> >> >If this approach works for you (or any other audience), that's great,
>> >> >> >I can share more details and perhaps we can reach something that we
>> >> >> >can both use :)
>> >> >>
>> >> >> If you have a good idea, please share more details or show some code.
>> >> >> I would greatly appreciate it
>> >> >
>> >> >The code we have needs to be rebased onto a different version and
>> >> >cleaned up before it can be shared, but essentially it is as
>> >> >described.
>> >> >
>> >> >(a) All processes and memcgs start with a default score.
>> >> >(b) Userspace can specify scores for memcgs and processes. A higher
>> >> >score means higher priority (aka less score gets killed first).
>> >> >(c) The OOM killer essentially looks for the memcg with the lowest
>> >> >scores to kill, then among this memcg, it looks for the process with
>> >> >the lowest score. Ties are broken based on usage, so essentially if
>> >> >all processes/memcgs have the default score, we fallback to the
>> >> >current OOM behavior.
>> >>
>> >> If memory oversold is severe, all processes of the lowest priority
>> >> memcg may be killed before selecting other memcg processes.
>> >> If there are 1000 processes with almost zero memory usage in
>> >> the lowest priority memcg, 1000 invalid kill events may occur.
>> >> To avoid this situation, even for the lowest priority memcg,
>> >> I will leave him a very small oom.protect quota.
>> >
>> >I checked internally, and this is indeed something that we see from
>> >time to time. We try to avoid that with userspace OOM killing, but
>> >it's not 100% effective.
>> >
>> >>
>> >> If faced with two memcgs with the same total memory usage and
>> >> priority, memcg A has more processes but less memory usage per
>> >> single process, and memcg B has fewer processes but more
>> >> memory usage per single process, then when OOM occurs, the
>> >> processes in memcg B may continue to be killed until all processes
>> >> in memcg B are killed, which is unfair to memcg B because memcg A
>> >> also occupies a large amount of memory.
>> >
>> >I believe in this case we will kill one process in memcg B, then the
>> >usage of memcg A will become higher, so we will pick a process from
>> >memcg A next.
>> If there is only one process in memcg A and its memory usage is higher
>> than any other process in memcg B, but the total memory usage of
>> memcg A is lower than that of memcg B. In this case, if the OOM-killer
>> still chooses the process in memcg A. it may be unfair to memcg A.
>> >> Dose your approach have these issues? Killing processes in a
>> >> user-defined priority is indeed easier and can work well in most cases,
>> >> but I have been trying to solve the cases that it cannot cover.
>> >
>> >The first issue is relatable with our approach. Let me dig more info
>> >from our internal teams and get back to you with more details.

Thanks for your comment!