Re: [PATCH 03/13] mm: shmem: provide oom badness for shmem files

From: Christian König
Date: Wed Jun 15 2022 - 08:35:40 EST


Am 13.06.22 um 16:11 schrieb Michal Hocko:
[SNIP]
Let me maybe get back to the initial question: We have resources which are
not related to the virtual address space of a process, how should we tell
the OOM killer about them?
I would say memcg, but we have discussed this already...

Well memcg is at least closer to the requirements than the classic mm_struct accounting.

It won't work for really shared buffers, but if that's the requirement to find some doable solution for the remaining 99% then I can live with that.

I do not think that exposing a resource (in a form of a counter
or something like that) is sufficient. The existing oom killer
implementation is hevily process centric (with memcg extension for
grouping but not changing the overall design in principle). If you
want to make it aware of resources which are not directly accounted to
processes then a a new implementation is necessary IMHO. You would need
to evaluate those resources and kill all the tasks that can hold on that
resource.

Well the OOM killer is process centric because processes are what you can kill.

Even the classic mm_struct based accounting includes MM_SHMEMPAGES into the badness. So accounting shared resources as badness to make a decision is nothing new here.

The difference is that this time the badness doesn't come from the memory management subsystem, but rather from the I/O subsystem.

This is also the reason why I am not really fan of the per file
badness because it adds a notion of resource that is not process bound
in general so it will add all sorts of weird runtime corner cases which
are impossible to anticipate [*]. Maybe that will work in some scenarios
but definitely not something to be done by default without users opting
into that and being aware of consequences.

Would a kernel command line option to control the behavior be helpful here?

There have been discussions that the existing oom implementation cannot
fit all potential usecases so maybe we need to finally decide to use a
plugable, BPFable etc architecture allow implementations that fit
specific needs.

Yeah, BPF came to my mind as well. But need to talk with out experts on that topic first.

When the OOM killer runs allocating more memory is pretty much a no-go and I'm not sure what the requirements of running a BPF to find the badness are.

[*] I know it is not directly related but kinda similar. In the past
we used to have heuristics to consider work done as a resource . That is
kill younger processes preferably to reduce the damage. This has turned
out to have a very unpredictable behavior and many complains by
users. Situation has improved when the selection was solely based on
rss. This has its own cons of course but at least they are predictable.

Good to know, thanks.

Regards,
Christian.