Re: Re: [RFC PATCH 0/5] mm: Select victim memcg using BPF_OOM_POLICY

From: Abel Wu
Date: Tue Aug 01 2023 - 02:53:38 EST


On 7/28/23 12:30 PM, Roman Gushchin wrote:
On Thu, Jul 27, 2023 at 10:15:16AM +0200, Michal Hocko wrote:
On Thu 27-07-23 15:36:27, Chuyi Zhou wrote:
This patchset tries to add a new bpf prog type and use it to select
a victim memcg when global OOM is invoked. The mainly motivation is
the need to customizable OOM victim selection functionality so that
we can protect more important app from OOM killer.

This is rather modest to give an idea how the whole thing is supposed to
work. I have looked through patches very quickly but there is no overall
design described anywhere either.

Please could you give us a high level design description and reasoning
why certain decisions have been made? e.g. why is this limited to the
global oom sitation, why is the BPF program forced to operate on memcgs
as entities etc...
Also it would be very helpful to call out limitations of the BPF
program, if there are any.

One thing I realized recently: we don't have to make a victim selection
during the OOM, we [almost always] can do it in advance.

I agree. We take precautions against memory shortage on over-committed
machines through oomd-like userspace tools, to mitigate possible SLO
violations on important services. The kernel OOM-killer in our scenario
works as a last resort, since userspace tools are not that reliable.
IMHO it would be useful for kernel to provide such flexibility.


Kernel OOM's must guarantee the forward progress under heavy memory pressure
and it creates a lot of limitations on what can and what can't be done in
these circumstances.

But in practice most policies except maybe those which aim to catch very fast
memory spikes rely on things which are fairly static: a logical importance of
several workloads in comparison to some other workloads, "age", memory footprint
etc.

So I wonder if the right path is to create a kernel interface which allows
to define a OOM victim (maybe several victims, also depending on if it's
a global or a memcg oom) and update it periodically from an userspace.

Something like [1] proposed by Chuyi? IIUC there is still lack of some
triggers to invoke the procedure so we can actually do this in advance.

[1] https://lore.kernel.org/lkml/f8f44103-afba-10ee-b14b-a8e60a7f33d8@xxxxxxxxxxxxx/

Thanks & Best,
Abel