Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure

From: Pintu Agarwal
Date: Fri Aug 09 2019 - 10:19:00 EST


[...]

Hi,

This is an interesting topic for me so I would like to join the conversation.
I will be glad if I can be of any help here either in testing PSI, or
verifying some scenarios and observation.

I have some experience working with low memory embedded devices, like
RAM as low as 128MB, 256MB, less than 1GB mostly, with/without
Display, DRM/Graphics support.
Along with ZRAM as swap space configured as 25% of RAM size.
The eMMC storage space is also as low as 4GB or 8GB max.

So, I have experienced this sluggishness, hang, OOM kill issues quite
a number of times.
So, I would like to share my experience and observation here.

Recently, I have been exploring the PSI feature on my ARM
Qemu/Beagle-Bone environment, so I can share some feedback for this as
well.

The system sluggish behavior can result from 4 types (specially on
smart phone devices):
* memory allocation pressure
* I/O pressure
* Scheduling pressure
* Network pressure

I think the topic of concern here is: memory pressure.
So, I would like to share some thoughts about this.

* In my opinion, memory pressure should be internal to the system and
not visible to the end users.
* The pressure metrics can very from system to system, so its
difficult to apply single policy.
* I guess this is the time to apply "Machine Learning" and "Artificial
Intelligence" into the system :)

* The memory pressure starts with how many times and how quickly
system is entering the slow-path.
Thus slow-path monitoring may give some clue about pressure building
in the system.
Thus I use to use slow-path-counter.
Too much of slow-path in the beginning itself indicates that this
system needs to be re-designed.

* The system should be avoided to entering slow-path again and again
thus avoiding pressure.
If this happens then its time to reclaim memory in large chunk,
rather than in smaller chunk.
May be its time to think about shrink_all_memory() knob in kernel.
It can be run as bottom-half processing, may be from cgroups.

* Some experiment were done in the past. Interested people can check this paper:
http://events17.linuxfoundation.org/sites/events/files/slides/%5BELC-2015%5D-System-wide-Memory-Defragmenter.pdf

* The system is already behaving sluggish even before it enters oom-kill stage.
So, most of the time oom stage is skipped, not occurred, or its just
looping around.
Thus, some kind of oom-monitoring may help to gather some suspect.
Thats the reason I proposed to use something called
oom-stall-counter. That means system entering oom, but not possibly
oom-kill.
If this counter is updated means we assume that system started
behaving sluggish.

* A oom-kill-counter can also help in determining how much of killing
happening in kernel space.
Example: If PSI pressure is building up and this counter is not updating...
But in any case system-daemon should be avoided from killing.

* Some killing policy should be left to user space. So a standard
system-daemon (or kthread) should be designed along the line.
It should be configured dynamically based on the system and oom-score.
As my previous experience, in Tizen, we have used something called:
resourced daemon.
https://git.tizen.org/cgit/platform/core/system/resourced/tree/src/memory?h=tizen

* Instead of static policy there should be something called "Dynamic
Low Memory Manager" (DLLM) policy.
That is at every stage (slow-path, swapping, compaction-fail,
reclaim-fail, oom) some action can be taken.
Earlier this event was triggered using vmpressure, but now it can
replace with PSI.

* Another major culprit with sluggish in the long run is, the
system-daemon occupying all of swap space and never releasing it.
So, even if the kill applications due to oom, it may not help much.
Since daemons will never be killed.
So, I proposed something called "Dynamic Swappiness", where
swappiness of daemons came be lowered dynamically, while normal
application have higher values.
In the past I have done several experiments on this, soon I will be
publishing a paper on it.

* May be it is helpful to understand better, if we start from a very
minimal scale (just 64MB to 512MB RAM) with busy-box.
If we can tune this perfectly, than large scale will automatically
have no issues.

With respect to PSI, here are my observations:
* PSI memory threshold (10, 60, 300) are too high for an embedded system.
I think these settings should be dynamic, or user configurable, or
there should be on more entry for 1s or lesser.
* PSI memory values are updated after the oom-kill in kernel had
already happened, that means sluggish already occurred.
So, I have to utilize the "total" field and monitor the difference manually.
Like the difference between previous-total and next-total is more
than 100ms and rising, then we suspect OOM.
* Currently, PSI values are system-wide. That is, after sluggish
occurred, it is difficult to predict, which task causes sluggish.
So, I was thinking to add new entry to capture task details as well.


These are some of my opinion. It may or may not be applicable directly.
Further brain-storming or discussions might be required.



Regards,
Pintu