Subject: [PATCH][RFC] mm: make working set portion that is protectedtunable v2

From: Christian Ehrhardt
Date: Mon Apr 26 2010 - 06:59:54 EST


Subject: [PATCH][RFC] mm: make working set portion that is protected tunable v2

From: Christian Ehrhardt <ehrhardt@xxxxxxxxxxxxxxxxxx>

*updates in v2*
- use do_div

This patch creates a knob to help users that have workloads suffering from the
fix 1:1 active inactive ratio brought into the kernel by "56e49d21 vmscan:
evict use-once pages first".
It also provides the tuning mechanisms for other users that want an even bigger
working set to be protected.

To be honest the best solution would be to allow a system not using the working
set to regain that memory *somewhen*, and therefore without drawbacks to the
scenarios it was implemented for e.g. UI interactivity while copying a lot of
data. But up to now there was no idea how to get that behaviour implemented.

In the old thread started by Elladan that finally led to 56e49d21 Wu Fengguang
wrote:
"In the worse scenario, it could waste half the memory that could
otherwise be used for readahead buffer and to prevent thrashing, in a
server serving large datasets that are hardly reused, but still slowly
builds up its active list during the long uptime (think about a slowly
performance downgrade that can be fixed by a crude dropcache action).

That said, the actual performance degradation could be much smaller -
say 15% - all memories are not equal."

We now identified a case with up to -60% Throughput, therefore this patch tries
to provide a more gentle interface than drop_caches to help a system stuck in
this.

In discussion with Rik van Riel and Joannes Weiner we came up that there are
cases that want the current "save 50%" for the working set all the time and
others that would benefit from protectig only a smaller amount.

Eventually no "carved in stone" in kernel ratio will match all use cases,
therefore this patch makes the value tunable via a /proc/sys/vm/ interface
named active_inactive_ratio.

Example configurations might be:
- 50% - like the current kernel
- 0% - like a kernel pre 56e49d21
- x% - allow customizing the system to someones needs

Due to our experiments the suggested default in this patch is 25%, but if
preferred I'm fine keeping 50% and letting admins/distros adapt as needed.

Signed-off-by: Christian Ehrhardt <ehrhardt@xxxxxxxxxxxxxxxxxx>
---

[diffstat]
Documentation/sysctl/vm.txt | 10 ++++++++++
include/linux/mm.h | 2 ++
kernel/sysctl.c | 9 +++++++++
mm/memcontrol.c | 9 ++++++---
mm/vmscan.c | 17 ++++++++++++++---
5 files changed, 41 insertions(+), 6 deletions(-)

[diff]
Index: linux-2.6/Documentation/sysctl/vm.txt
===================================================================
--- linux-2.6.orig/Documentation/sysctl/vm.txt 2010-04-21 06:32:23.000000000 +0200
+++ linux-2.6/Documentation/sysctl/vm.txt 2010-04-21 07:24:35.000000000 +0200
@@ -18,6 +18,7 @@

Currently, these files are in /proc/sys/vm:

+- active_inactive_ratio
- block_dump
- dirty_background_bytes
- dirty_background_ratio
@@ -57,6 +58,15 @@

==============================================================

+active_inactive_ratio
+
+The kernel tries to protect the active working set. Therefore a portion of the
+file pages is protected, meaning they are omitted when eviting pages until this
+ratio is reached.
+This tunable represents that ratio in percent and specifies the protected part
+
+==============================================================
+
block_dump

block_dump enables block I/O debugging when set to a nonzero value. More
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c 2010-04-21 06:33:43.000000000 +0200
+++ linux-2.6/kernel/sysctl.c 2010-04-21 07:26:35.000000000 +0200
@@ -1271,6 +1271,15 @@
.extra2 = &one,
},
#endif
+ {
+ .procname = "active_inactive_ratio",
+ .data = &sysctl_active_inactive_ratio,
+ .maxlen = sizeof(sysctl_active_inactive_ratio),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &zero,
+ .extra2 = &one_hundred,
+ },

/*
* NOTE: do not add new entries to this table unless you have read
Index: linux-2.6/mm/memcontrol.c
===================================================================
--- linux-2.6.orig/mm/memcontrol.c 2010-04-21 06:31:29.000000000 +0200
+++ linux-2.6/mm/memcontrol.c 2010-04-26 12:45:46.000000000 +0200
@@ -893,12 +893,15 @@
int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
{
unsigned long active;
- unsigned long inactive;
+ unsigned long activetoprotect;

- inactive = mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE);
active = mem_cgroup_get_local_zonestat(memcg, LRU_ACTIVE_FILE);
+ activetoprotect = active
+ + mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE)
+ * sysctl_active_inactive_ratio;
+ activetoprotect = do_div(activetoprotect, 100);

- return (active > inactive);
+ return (active > activetoprotect);
}

unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c 2010-04-21 06:31:29.000000000 +0200
+++ linux-2.6/mm/vmscan.c 2010-04-26 12:50:47.000000000 +0200
@@ -1459,14 +1459,25 @@
return low;
}

+/*
+ * sysctl_active_inactive_ratio
+ *
+ * Defines the portion of file pages within the active working set is going to
+ * be protected. The value represents the percentage that will be protected.
+ */
+int sysctl_active_inactive_ratio __read_mostly = 25;
+
static int inactive_file_is_low_global(struct zone *zone)
{
- unsigned long active, inactive;
+ unsigned long active, activetoprotect;

active = zone_page_state(zone, NR_ACTIVE_FILE);
- inactive = zone_page_state(zone, NR_INACTIVE_FILE);
+ activetoprotect = zone_page_state(zone, NR_FILE)
+ * sysctl_active_inactive_ratio;
+ activetoprotect = do_div(activetoprotect, 100);
+
+ return (active > activetoprotect);

- return (active > inactive);
}

/**
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h 2010-04-21 09:02:37.000000000 +0200
+++ linux-2.6/include/linux/mm.h 2010-04-21 09:02:51.000000000 +0200
@@ -1467,5 +1467,7 @@

extern void dump_page(struct page *page);

+extern int sysctl_active_inactive_ratio;
+
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_H */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/