Re: [PATCH] Prevent OOM casualties by enforcing memcg limits

From: Alexander Sosna
Date: Tue Apr 27 2021 - 07:01:46 EST

On 27.04.21 10:08, Michal Hocko wrote:
> On Tue 27-04-21 08:37:30, Alexander Sosna wrote:
>> Hi Chris,
>> Am 27.04.21 um 02:09 schrieb Chris Down:
>>> Hi Alexander,
>>> Alexander Sosna writes:
>>>> Before this commit memory cgroup limits were not enforced during
>>>> allocation.  If a process within a cgroup tries to allocates more
>>>> memory than allowed, the kernel will not prevent the allocation even if
>>>> OVERCOMMIT_NEVER is set.  Than the OOM killer is activated to kill
>>>> processes in the corresponding cgroup.
>>> Unresolvable cgroup overages are indifferent to vm.overcommit_memory,
>>> since exceeding memory.max is not overcommitment, it's just a natural
>>> consequence of the fact that allocation and reclaim are not atomic
>>> processes. Overcommitment, on the other hand, is about the bounds of
>>> available memory at the global resource level.
>>>> This behavior is not to be expected
>>>> when setting OVERCOMMIT_NEVER (vm.overcommit_memory = 2) and it is a huge
>>>> problem for applications assuming that the kernel will deny an allocation
>>>> if not enough memory is available, like PostgreSQL.  To prevent this a
>>>> check is implemented to not allow a process to allocate more memory than
>>>> limited by it's cgroup.  This means a process will not be killed while
>>>> accessing pages but will receive errors on memory allocation as
>>>> appropriate.  This gives programs a chance to handle memory allocation
>>>> failures gracefully instead of being reaped.
>>> We don't guarantee that vm.overcommit_memory 2 means "no OOM killer". It
>>> can still happen for a bunch of reasons, so I really hope PostgreSQL
>>> isn't relying on that.
>>> Could you please be more clear about the "huge problem" being solved
>>> here? I'm not seeing it.
>> let me explain the problem I encounter and why I fell down the mm rabbit
>> hole. It is not a PostgreSQL specific problem but that's where I run
>> into it. PostgreSQL forks a backend for each client connection. All
>> backends have shared memory as well as local work memory. When a
>> backend needs more dynamic work_mem to execute a query, new memory
>> is allocated. It is normal that such an allocation can fail. If the
>> backend gets an ENOMEM the current query is rolled back an all dynamic
>> work_mem is freed. The RDBMS stays operational an no other query is
>> disturbed.
> I am afraid the kernel MM implementation has never been really
> compatible with such a memory allocation model. Linux has always
> preferred to pretend there is always memory available and rather reclaim
> memory - including by killing some processes - rather than fail the
> allocation eith ENOMEM. Overcommit configuration (especially
> OVERCOMMIT_NEVER) is an attempt to somehow mitigate this ambitious
> memory allocation approach but in reality this has turned out a)
> unreliable and b) unsuable with modern userspace which relies on
> considerable virtual memory overcommit.

Thank you for taking the time to discuss this issue with me. I agree
that the kernel and a lot of software prefers to pretend there is more
memory than there really is. It was also never possible to assume that
the OOM killer is fully absent. I'm running production Linux systems
for quite a while now and without memory cgroups involved
OVERCOMMIT_NEVER does a pretty good job. I can't even remember the last
time the OOM killer caused me any problems on a properly configured
database server. This is what I would like and what users should be
able to expect for the use with cgroup memory limits as well.

Please correct me if I am wrong, but "modern userspace which relies on
considerable virtual memory overcommit" should not rely on the kernel to
overcommit memory when OVERCOMMIT_NEVER is explicitly set.

>> When running in a memory cgroup - for example via systemd or on k8s -
>> the kernel will not return ENOMEM even if the cgroup's memory limit is
>> exceeded.
> Yes, memcg doesn't change the overal approach. It just restricts the
> existing semantic with a smaller memory limit. Also overcommit heuristic
> has never been implemented for memory controllers.
>> Instead the OOM killer is awakened and kills processes in the
>> violating cgroup. If any backend is killed with SIGKILL the shared
>> memory of the whole cluster is deemed potentially corrupted and
>> PostgreSQL needs to do an emergency restart. This cancels all operation
>> on all backends and it entails a potentially lengthy recovery process.
>> Therefore the behavior is quite "costly".
> One way around that would be to use high limit rather than hard limit
> and pro-actively watch for memory utilization and communicate that back
> to the application to throttle its workers. I can see how that
>> I totally understand that vm.overcommit_memory 2 does not mean "no OOM
>> killer". IMHO it should mean "no OOM killer if we can avoid it" and I
> I do not see how it can ever promise anything like that. Memory
> consumption by kernel subsystems cannot be predicted at the time virtual
> memory allocated from the userspace. Not only it cannot be predicted but
> it is also highly impractical to force kernel allocations - necessary
> for the OS operation - to fail just because userspace has reserved
> virtual memory. So this all is just a heuristic to help in some
> extreme cases but overall I consider OVERCOMMIT_NEVER as impractical to
> say the least.

I'm not fully able to follow you why we need to let kernel allocations
fail here. Yes, if you run a system to a point where the kernel can't
free enough memory, invasive decisions have to be made. Think of an
application server running multiple applications in memcgs each with its
limits way below the available resources. Why is it preferable to
SIGKILL a process rather than just deny the limit exceeding malloc, when
OVERCOMMIT_NEVER is set of cause?

>> would highly appreciate if the kernel would use a less invasive means
>> whenever possible. I guess this might also be the expectation by many
>> other users. In my described case - which is a real pain for me - it is
>> quite easy to tweak the kernel behavior in order to handle this and
>> other similar situations with less casualties. This is why I send a
>> patch instead of starting a theoretical discussion.
> I am pretty sure that many users would agree with you on that but the
> matter of fact is that a different approach has been chosen
> historically. We can argue whether this has been a good or bad design
> decision but I do not see that to change without a lot of fallouts. Btw.
> a strong memory reservation approach can be found with hugetlb pages and
> this one has turned out to be very tricky both from implementation and
> userspace usage POV. Needless to say that it operates on a single
> purpose preallocated memory pool and it would be quite reasonable to
> expect the complexity would grow with more users of the pool which is
> the general case for general purpose memory allocator.

The history is very interesting and needs to be taken into
consideration. What drives me is to help myself and all other Linux
user to run workloads like RDBMS reliable, even in modern environments
like k8s which make use of memory cgroups. I see a gain for the
community to develop a reliable and easy available solution, even if my
current approach might be amateurish and is not the right answer. Could
you elaborate on where you see "a lot of fallouts"? overcommit_memory 2
is only set when needed for the desired workload.

If the gain is worth it one could implement an overcommit_memory 3 in
order to set this behavior, overcommit_memory needs to be explicitly set
by the sysadmin anyways.

>> What do you think is necessary to get this to an approvable quality?
> See my other reply.