[PATCH 1/3] mm: memcontrol: fix memory.low proportional distribution

From: Johannes Weiner
Date: Fri Dec 13 2019 - 14:22:16 EST


When memory.low is overcommitted - i.e. the children claim more
protection than their shared ancestor grants them - the allowance is
distributed in proportion to each siblings's utilized protection:

low_usage = min(low, usage)
elow = parent_elow * (low_usage / siblings_low_usage)

However, siblings_low_usage is not the sum of all low_usages. It sums
up the usages of *only those cgroups that are within their memory.low*
That means that low_usage can be *bigger* than siblings_low_usage, and
consequently the total protection afforded to the children can be
bigger than what the ancestor grants the subtree.

Consider three groups where two are in excess of their protection:

A/memory.low = 10G
A/A1/memory.low = 10G, A/memory.current = 20G
A/A2/memory.low = 10G, B/memory.current = 20G
A/A3/memory.low = 10G, C/memory.current = 8G

siblings_low_usage = 8G (only A3 contributes)
A1/elow = parent_elow(10G) * low_usage(20G) / siblings_low_usage(8G) = 25G

The 25G are then capped to A1's own memory.low setting, i.e. 10G. The
same is true for A2. And A3 would also receive 10G. The combined
protection of A1, A2 and A3 is 30G, when A limits the tree to 10G.

What does this mean in practice? A1 and A2 would still be in excess of
their 10G allowance and would be reclaimed, whereas A3 would not. As
they eventually drop below their protection setting, they would be
counted in siblings_low_usage again and the error would right itself.

When reclaim is applied in a binary fashion - cgroup is reclaimed when
it's above its protection, otherwise it's skipped - this could work
actually work out just fine - although it's not quite clear to me why
we'd introduce this error in the first place. However, since
1bc63fb1272b ("mm, memcg: make scan aggression always exclude
protection"), reclaim pressure is scaled to how much a cgroup is above
its protection. As a result this calculation error unduly skews
pressure away from A1 and A2 toward the rest of the system.

Fix this by by making siblings_low_usage the sum of all protected
memory among siblings, including those that are in excess of their
protection.

Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
---
mm/memcontrol.c | 4 +---
mm/page_counter.c | 12 ++----------
2 files changed, 3 insertions(+), 13 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c5b5f74cfd4d..874a0b00f89b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6236,9 +6236,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
* elow = min( memory.low, parent->elow * ------------------ ),
* siblings_low_usage
*
- * | memory.current, if memory.current < memory.low
- * low_usage = |
- * | 0, otherwise.
+ * low_usage = min(memory.low, memory.current)
*
*
* Such definition of the effective memory.low provides the expected
diff --git a/mm/page_counter.c b/mm/page_counter.c
index de31470655f6..75d53f15f040 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -23,11 +23,7 @@ static void propagate_protected_usage(struct page_counter *c,
return;

if (c->min || atomic_long_read(&c->min_usage)) {
- if (usage <= c->min)
- protected = usage;
- else
- protected = 0;
-
+ protected = min(usage, c->min);
old_protected = atomic_long_xchg(&c->min_usage, protected);
delta = protected - old_protected;
if (delta)
@@ -35,11 +31,7 @@ static void propagate_protected_usage(struct page_counter *c,
}

if (c->low || atomic_long_read(&c->low_usage)) {
- if (usage <= c->low)
- protected = usage;
- else
- protected = 0;
-
+ protected = min(usage, c->low);
old_protected = atomic_long_xchg(&c->low_usage, protected);
delta = protected - old_protected;
if (delta)
--
2.24.0