Re: [RFC PATCH] mm/swap: fix system stuck due to infinite loop

From: Alexey Avramov
Date: Mon Apr 05 2021 - 18:08:44 EST


> In the case of high system memory and load pressure, we ran ltp test
> and found that the system was stuck, the direct memory reclaim was
> all stuck in io_schedule

> For the first time involving the swap part, there is no good way to fix
> the problem

The solution is protecting the clean file pages.

Look at this:

> On ChromiumOS, we do not use swap. When memory is low, the only
> way to free memory is to reclaim pages from the file list. This
> results in a lot of thrashing under low memory conditions. We see
> the system become unresponsive for minutes before it eventually OOMs.
> We also see very slow browser tab switching under low memory. Instead
> of an unresponsive system, we'd really like the kernel to OOM as soon
> as it starts to thrash. If it can't keep the working set in memory,
> then OOM. Losing one of many tabs is a better behaviour for the user
> than an unresponsive system.

> This patch create a new sysctl, min_filelist_kbytes, which disables
> reclaim of file-backed pages when when there are less than min_filelist_bytes
> worth of such pages in the cache. This tunable is handy for low memory
> systems using solid-state storage where interactive response is more important
> than not OOMing.

> With this patch and min_filelist_kbytes set to 50000, I see very little block
> layer activity during low memory. The system stays responsive under low
> memory and browser tab switching is fast. Eventually, a process a gets killed
> by OOM. Without this patch, the system gets wedged for minutes before it
> eventually OOMs.

https://lore.kernel.org/patchwork/patch/222042/

This patch can almost completely eliminate thrashing under memory pressure.

Effects
- Improving system responsiveness under low-memory conditions;
- Improving performans in I/O bound tasks under memory pressure;
- OOM killer comes faster (with hard protection);
- Fast system reclaiming after OOM.

Read more: https://github.com/hakavlad/le9-patch

The patch:

From 371e3e5290652e97d5279d8cd215cd356c1fb47b Mon Sep 17 00:00:00 2001
From: Alexey Avramov <hakavlad@xxxxxxxx>
Date: Mon, 5 Apr 2021 01:53:26 +0900
Subject: [PATCH] mm/vmscan: add sysctl knobs for protecting the specified
amount of clean file cache

The kernel does not have a mechanism for targeted protection of clean
file pages (CFP). A certain amount of the CFP is required by the userspace
for normal operation. First of all, you need a cache of shared libraries
and executable files. If the volume of the CFP cache falls below a certain
level, thrashing and even livelock occurs.

Protection of CFP may be used to prevent thrashing and reducing I/O under
memory pressure. Hard protection of CFP may be used to avoid high latency
and prevent livelock in near-OOM conditions. The patch provides sysctl
knobs for protecting the specified amount of clean file cache under memory
pressure.

The vm.clean_low_kbytes sysctl knob provides *best-effort* protection of
CFP. The CFP on the current node won't be reclaimed uder memory pressure
when their volume is below vm.clean_low_kbytes *unless* we threaten to OOM
or have no swap space or vm.swappiness=0. Setting it to a high value may
result in a early eviction of anonymous pages into the swap space by
attempting to hold the protected amount of clean file pages in memory. The
default value is defined by CONFIG_CLEAN_LOW_KBYTES (suggested 0 in
Kconfig).

The vm.clean_min_kbytes sysctl knob provides *hard* protection of CFP. The
CFP on the current node won't be reclaimed under memory pressure when their
volume is below vm.clean_min_kbytes. Setting it to a high value may result
in a early out-of-memory condition due to the inability to reclaim the
protected amount of CFP when other types of pages cannot be reclaimed. The
default value is defined by CONFIG_CLEAN_MIN_KBYTES (suggested 0 in
Kconfig).

Reported-by: Artem S. Tashkinov <aros@xxxxxxx>
Signed-off-by: Alexey Avramov <hakavlad@xxxxxxxx>
---
Documentation/admin-guide/sysctl/vm.rst | 37 +++++++++++++++++++++
include/linux/mm.h | 3 ++
kernel/sysctl.c | 14 ++++++++
mm/Kconfig | 35 +++++++++++++++++++
mm/vmscan.c | 59 +++++++++++++++++++++++++++++++++
5 files changed, 148 insertions(+)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index f455fa00c..5d5ddfc85 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -26,6 +26,8 @@ Currently, these files are in /proc/sys/vm:

- admin_reserve_kbytes
- block_dump
+- clean_low_kbytes
+- clean_min_kbytes
- compact_memory
- compaction_proactiveness
- compact_unevictable_allowed
@@ -113,6 +115,41 @@ block_dump enables block I/O debugging when set to a nonzero value. More
information on block I/O debugging is in Documentation/admin-guide/laptops/laptop-mode.rst.


+clean_low_kbytes
+=====================
+
+This knob provides *best-effort* protection of clean file pages. The clean file
+pages on the current node won't be reclaimed uder memory pressure when their
+volume is below vm.clean_low_kbytes *unless* we threaten to OOM or have no
+swap space or vm.swappiness=0.
+
+Protection of clean file pages may be used to prevent thrashing and
+reducing I/O under low-memory conditions.
+
+Setting it to a high value may result in a early eviction of anonymous pages
+into the swap space by attempting to hold the protected amount of clean file
+pages in memory.
+
+The default value is defined by CONFIG_CLEAN_LOW_KBYTES.
+
+
+clean_min_kbytes
+=====================
+
+This knob provides *hard* protection of clean file pages. The clean file pages
+on the current node won't be reclaimed under memory pressure when their volume
+is below vm.clean_min_kbytes.
+
+Hard protection of clean file pages may be used to avoid high latency and
+prevent livelock in near-OOM conditions.
+
+Setting it to a high value may result in a early out-of-memory condition due to
+the inability to reclaim the protected amount of clean file pages when other
+types of pages cannot be reclaimed.
+
+The default value is defined by CONFIG_CLEAN_MIN_KBYTES.
+
+
compact_memory
==============

diff --git a/include/linux/mm.h b/include/linux/mm.h
index db6ae4d3f..7799f1555 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -202,6 +202,9 @@ static inline void __mm_zero_struct_page(struct page *page)

extern int sysctl_max_map_count;

+extern unsigned long sysctl_clean_low_kbytes;
+extern unsigned long sysctl_clean_min_kbytes;
+
extern unsigned long sysctl_user_reserve_kbytes;
extern unsigned long sysctl_admin_reserve_kbytes;

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index afad08596..854b311cd 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -3083,6 +3083,20 @@ static struct ctl_table vm_table[] = {
},
#endif
{
+ .procname = "clean_low_kbytes",
+ .data = &sysctl_clean_low_kbytes,
+ .maxlen = sizeof(sysctl_clean_low_kbytes),
+ .mode = 0644,
+ .proc_handler = proc_doulongvec_minmax,
+ },
+ {
+ .procname = "clean_min_kbytes",
+ .data = &sysctl_clean_min_kbytes,
+ .maxlen = sizeof(sysctl_clean_min_kbytes),
+ .mode = 0644,
+ .proc_handler = proc_doulongvec_minmax,
+ },
+ {
.procname = "user_reserve_kbytes",
.data = &sysctl_user_reserve_kbytes,
.maxlen = sizeof(sysctl_user_reserve_kbytes),
diff --git a/mm/Kconfig b/mm/Kconfig
index 390165ffb..3915c71e1 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -122,6 +122,41 @@ config SPARSEMEM_VMEMMAP
pfn_to_page and page_to_pfn operations. This is the most
efficient option when sufficient kernel resources are available.

+config CLEAN_LOW_KBYTES
+ int "Default value for vm.clean_low_kbytes"
+ depends on SYSCTL
+ default "0"
+ help
+ The vm.clean_file_low_kbytes sysctl knob provides *best-effort*
+ protection of clean file pages. The clean file pages on the current
+ node won't be reclaimed uder memory pressure when their volume is
+ below vm.clean_low_kbytes *unless* we threaten to OOM or have
+ no swap space or vm.swappiness=0.
+
+ Protection of clean file pages may be used to prevent thrashing and
+ reducing I/O under low-memory conditions.
+
+ Setting it to a high value may result in a early eviction of anonymous
+ pages into the swap space by attempting to hold the protected amount of
+ clean file pages in memory.
+
+config CLEAN_MIN_KBYTES
+ int "Default value for vm.clean_min_kbytes"
+ depends on SYSCTL
+ default "0"
+ help
+ The vm.clean_file_min_kbytes sysctl knob provides *hard* protection
+ of clean file pages. The clean file pages on the current node won't be
+ reclaimed under memory pressure when their volume is below
+ vm.clean_min_kbytes.
+
+ Hard protection of clean file pages may be used to avoid high latency and
+ prevent livelock in near-OOM conditions.
+
+ Setting it to a high value may result in a early out-of-memory condition
+ due to the inability to reclaim the protected amount of clean file pages
+ when other types of pages cannot be reclaimed.
+
config HAVE_MEMBLOCK_PHYS_MAP
bool

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7b4e31eac..77e98c43e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -120,6 +120,19 @@ struct scan_control {
/* The file pages on the current node are dangerously low */
unsigned int file_is_tiny:1;

+ /*
+ * The clean file pages on the current node won't be reclaimed when
+ * their volume is below vm.clean_low_kbytes *unless* we threaten
+ * to OOM or have no swap space or vm.swappiness=0.
+ */
+ unsigned int clean_below_low:1;
+
+ /*
+ * The clean file pages on the current node won't be reclaimed when
+ * their volume is below vm.clean_min_kbytes.
+ */
+ unsigned int clean_below_min:1;
+
/* Allocation order */
s8 order;

@@ -166,6 +179,17 @@ struct scan_control {
#define prefetchw_prev_lru_page(_page, _base, _field) do { } while (0)
#endif

+#if CONFIG_CLEAN_LOW_KBYTES < 0
+#error "CONFIG_CLEAN_LOW_KBYTES must be >= 0"
+#endif
+
+#if CONFIG_CLEAN_MIN_KBYTES < 0
+#error "CONFIG_CLEAN_MIN_KBYTES must be >= 0"
+#endif
+
+unsigned long sysctl_clean_low_kbytes __read_mostly = CONFIG_CLEAN_LOW_KBYTES;
+unsigned long sysctl_clean_min_kbytes __read_mostly = CONFIG_CLEAN_MIN_KBYTES;
+
/*
* From 0 .. 200. Higher means more swappy.
*/
@@ -2283,6 +2307,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
}

/*
+ * Force-scan anon if clean file pages is under vm.clean_min_kbytes
+ * or vm.clean_low_kbytes (unless the swappiness setting
+ * disagrees with swapping).
+ */
+ if ((sc->clean_below_low || sc->clean_below_min) && swappiness) {
+ scan_balance = SCAN_ANON;
+ goto out;
+ }
+
+ /*
* If there is enough inactive page cache, we do not reclaim
* anything from the anonymous working right now.
*/
@@ -2418,6 +2452,13 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
BUG();
}

+ /*
+ * Don't reclaim clean file pages when their volume is below
+ * vm.clean_min_kbytes.
+ */
+ if (file && sc->clean_below_min)
+ scan = 0;
+
nr[lru] = scan;
}
}
@@ -2768,6 +2809,24 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
anon >> sc->priority;
}

+ if (sysctl_clean_low_kbytes || sysctl_clean_min_kbytes) {
+ unsigned long reclaimable_file, dirty, clean;
+
+ reclaimable_file =
+ node_page_state(pgdat, NR_ACTIVE_FILE) +
+ node_page_state(pgdat, NR_INACTIVE_FILE) +
+ node_page_state(pgdat, NR_ISOLATED_FILE);
+ dirty = node_page_state(pgdat, NR_FILE_DIRTY);
+ if (reclaimable_file > dirty)
+ clean = (reclaimable_file - dirty) << (PAGE_SHIFT - 10);
+
+ sc->clean_below_low = clean < sysctl_clean_low_kbytes;
+ sc->clean_below_min = clean < sysctl_clean_min_kbytes;
+ } else {
+ sc->clean_below_low = false;
+ sc->clean_below_min = false;
+ }
+
shrink_node_memcgs(pgdat, sc);

if (reclaim_state) {
--
2.11.0