Re: [RFC PATCH] mm, memcg: introduce memory.high.throttle
From: Balbir Singh
Date: Thu Jan 30 2025 - 17:27:32 EST
On 1/31/25 07:19, Johannes Weiner wrote:
> On Thu, Jan 30, 2025 at 12:07:31PM -0500, Waiman Long wrote:
>> On 1/30/25 11:39 AM, Johannes Weiner wrote:
>>> On Thu, Jan 30, 2025 at 09:52:29AM -0500, Waiman Long wrote:
>>>> On 1/29/25 3:10 PM, Yosry Ahmed wrote:
>>>>> On Wed, Jan 29, 2025 at 02:12:04PM -0500, Waiman Long wrote:
>>>>>> Since commit 0e4b01df8659 ("mm, memcg: throttle allocators when failing
>>>>>> reclaim over memory.high"), the amount of allocator throttling had
>>>>>> increased substantially. As a result, it could be difficult for a
>>>>>> misbehaving application that consumes increasing amount of memory from
>>>>>> being OOM-killed if memory.high is set. Instead, the application may
>>>>>> just be crawling along holding close to the allowed memory.high memory
>>>>>> for the current memory cgroup for a very long time especially those
>>>>>> that do a lot of memcg charging and uncharging operations.
>>>>>>
>>>>>> This behavior makes the upstream Kubernetes community hesitate to
>>>>>> use memory.high. Instead, they use only memory.max for memory control
>>>>>> similar to what is being done for cgroup v1 [1].
>>>>>>
>>>>>> To allow better control of the amount of throttling and hence the
>>>>>> speed that a misbehving task can be OOM killed, a new single-value
>>>>>> memory.high.throttle control file is now added. The allowable range
>>>>>> is 0-32. By default, it has a value of 0 which means maximum throttling
>>>>>> like before. Any non-zero positive value represents the corresponding
>>>>>> power of 2 reduction of throttling and makes OOM kills easier to happen.
>>>>>>
>>>>>> System administrators can now use this parameter to determine how easy
>>>>>> they want OOM kills to happen for applications that tend to consume
>>>>>> a lot of memory without the need to run a special userspace memory
>>>>>> management tool to monitor memory consumption when memory.high is set.
>>>>>>
>>>>>> Below are the test results of a simple program showing how different
>>>>>> values of memory.high.throttle can affect its run time (in secs) until
>>>>>> it gets OOM killed. This test program allocates pages from kernel
>>>>>> continuously. There are some run-to-run variations and the results
>>>>>> are just one possible set of samples.
>>>>>>
>>>>>> # systemd-run -p MemoryHigh=10M -p MemoryMax=20M -p MemorySwapMax=10M \
>>>>>> --wait -t timeout 300 /tmp/mmap-oom
>>>>>>
>>>>>> memory.high.throttle service runtime
>>>>>> -------------------- ---------------
>>>>>> 0 120.521
>>>>>> 1 103.376
>>>>>> 2 85.881
>>>>>> 3 69.698
>>>>>> 4 42.668
>>>>>> 5 45.782
>>>>>> 6 22.179
>>>>>> 7 9.909
>>>>>> 8 5.347
>>>>>> 9 3.100
>>>>>> 10 1.757
>>>>>> 11 1.084
>>>>>> 12 0.919
>>>>>> 13 0.650
>>>>>> 14 0.650
>>>>>> 15 0.655
>>>>>>
>>>>>> [1] https://docs.google.com/document/d/1mY0MTT34P-Eyv5G1t_Pqs4OWyIH-cg9caRKWmqYlSbI/edit?tab=t.0
>>>>>>
>>>>>> Signed-off-by: Waiman Long <longman@xxxxxxxxxx>
>>>>>> ---
>>>>>> Documentation/admin-guide/cgroup-v2.rst | 16 ++++++++--
>>>>>> include/linux/memcontrol.h | 2 ++
>>>>>> mm/memcontrol.c | 41 +++++++++++++++++++++++++
>>>>>> 3 files changed, 57 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
>>>>>> index cb1b4e759b7e..df9410ad8b3b 100644
>>>>>> --- a/Documentation/admin-guide/cgroup-v2.rst
>>>>>> +++ b/Documentation/admin-guide/cgroup-v2.rst
>>>>>> @@ -1291,8 +1291,20 @@ PAGE_SIZE multiple when read back.
>>>>>> Going over the high limit never invokes the OOM killer and
>>>>>> under extreme conditions the limit may be breached. The high
>>>>>> limit should be used in scenarios where an external process
>>>>>> - monitors the limited cgroup to alleviate heavy reclaim
>>>>>> - pressure.
>>>>>> + monitors the limited cgroup to alleviate heavy reclaim pressure
>>>>>> + unless a high enough value is set in "memory.high.throttle".
>>>>>> +
>>>>>> + memory.high.throttle
>>>>>> + A read-write single value file which exists on non-root
>>>>>> + cgroups. The default is 0.
>>>>>> +
>>>>>> + Memory usage throttle control. This value controls the amount
>>>>>> + of throttling that will be applied when memory consumption
>>>>>> + exceeds the "memory.high" limit. The larger the value is,
>>>>>> + the smaller the amount of throttling will be and the easier an
>>>>>> + offending application may get OOM killed.
>>>>> memory.high is supposed to never invoke the OOM killer (see above). It's
>>>>> unclear to me if you are referring to OOM kills from the kernel or
>>>>> userspace in the commit message. If the latter, I think it shouldn't be
>>>>> in kernel docs.
>>>> I am sorry for not being clear. What I meant is that if an application
>>>> is consuming more memory than what can be recovered by memory reclaim,
>>>> it will reach memory.max faster, if set, and get OOM killed. Will
>>>> clarify that in the next version.
>>> You're not really supposed to use max and high in conjunction. One is
>>> for kernel OOM killing, the other for userspace OOM killing. That's tho
>>> what the documentation that you edited is trying to explain.
>>>
>>> What's the usecase you have in mind?
>>
>> That is new to me that high and max are not supposed to be used
>> together. One problem with v1 is that by the time the limit is reached
>> and memory reclaim is not able to recover enough memory in time, the
>> task will be OOM killed. I always thought that by setting high to a bit
>> below max, say 90%, early memory reclaim will reduce the chance of OOM
>> kills. There are certainly others that think like that.
>
> I can't fault you or them for this, because this was the original plan
> for these knobs. However, this didn't end up working in practice.
>
> If you have a non-throttling, non-killing limit, then reclaim will
> either work and keep the workload to that limit; or it won't work, and
> the workload escapes to the hard limit and gets killed.
>
> You'll notice you get the same behavior with just memory.max set by
> itself - either reclaim can keep up, or OOM is triggered.
Yep that was intentional, it was best effort.
>
>> So the use case here is to reduce the chance of OOM kills without
>> letting really mishaving tasks from holding up useful memory for too long.
>
> That brings us to the idea of a medium amount of throttling.
>
> The premise would be that, by throttling *to a certain degree*, you
> can slow the workload down just enough to tide over the pressure peak
> and avert the OOM kill.
>
> This assumes that some tasks inside the cgroup can independently make
> forward progress and release memory, while allocating tasks inside the
> group are already throttled.
>
> [ Keep in mind, it's a cgroup-internal limit, so no memory freeing
> outside of the group can alleviate the situation. Progress must
> happen from within the cgroup. ]
>
> But this sort of parallelism in a pressured cgroup is unlikely in
> practice. By the time reclaim fails, usually *every task* in the
> cgroup ends up having to allocate. Because they lose executables to
> cache reclaim, or heap memory to swap etc, and then page fault.
>
> We found that more often than not, it just deteriorates into a single
> sequence of events. Slowing it down just drags out the inevitable.
>
> As a result we eventually moved away from the idea of gradual
> throttling. The last remnants of this idea finally disappeared from
> the docs last year (commit 5647e53f7856bb39dae781fe26aa65a699e2fc9f).
>
> memory.high now effectively puts the cgroup to sleep when reclaim
> fails (similar to oom killer disabling in v1, but without the caveats
> of that implementation). This is useful to let userspace implement
> custom OOM killing policies.
>
I've found using memory.high as limiting the way you've defined (using a benchmark
like STREAM, the benchmark did not finish and was stalled for several hours when
it was short of a few GB's of memory) and I did not have a background user space process
to do a user space kill. In my case, reclaim was able to reclaim a little bit, so some
forward progress was made and nr_retries limit was never hit (IIRC).
Effectively in v1 soft_limit was supposed to be best effort pushing back and OOM kill
could find a task to kill globally (initial design) if there was global memory pressure.
For this discussion adding memory.high.throttle seems like it's bridging the transition
from memory.high to memory.max/OOM without external intervention. I do feel that not
killing the task, just locks the task in the memcg forever (at-least in my case) and
it sounds like using memory.high requires an external process monitor to kill the task
if it does not make progress.
Warm Regards,
Balbir Singh