Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2

From: Kamezawa Hiroyuki
Date: Mon Dec 14 2015 - 22:23:27 EST


On 2015/12/15 0:30, Michal Hocko wrote:
On Thu 10-12-15 14:39:14, Vladimir Davydov wrote:
In the legacy hierarchy we charge memsw, which is dubious, because:

- memsw.limit must be >= memory.limit, so it is impossible to limit
swap usage less than memory usage. Taking into account the fact that
the primary limiting mechanism in the unified hierarchy is
memory.high while memory.limit is either left unset or set to a very
large value, moving memsw.limit knob to the unified hierarchy would
effectively make it impossible to limit swap usage according to the
user preference.

- memsw.usage != memory.usage + swap.usage, because a page occupying
both swap entry and a swap cache page is charged only once to memsw
counter. As a result, it is possible to effectively eat up to
memory.limit of memory pages *and* memsw.limit of swap entries, which
looks unexpected.

That said, we should provide a different swap limiting mechanism for
cgroup2.
This patch adds mem_cgroup->swap counter, which charges the actual
number of swap entries used by a cgroup. It is only charged in the
unified hierarchy, while the legacy hierarchy memsw logic is left
intact.

I agree that the previous semantic was awkward. The problem I can see
with this approach is that once the swap limit is reached the anon
memory pressure might spill over to other and unrelated memcgs during
the global memory pressure. I guess this is what Kame referred to as
anon would become mlocked basically. This would be even more of an issue
with resource delegation to sub-hierarchies because nobody will prevent
setting the swap amount to a small value and use that as an anon memory
protection.

I guess this was the reason why this approach hasn't been chosen before

Yes. At that age, "never break global VM" was the policy. And "mlock" can be
used for attacking system.

but I think we can come up with a way to stop the run away consumption
even when the swap is accounted separately. All of them are quite nasty
but let me try.

We could allow charges to fail even for the high limit if the excess is
way above the amount of reclaimable memory in the given memcg/hierarchy.
A runaway load would be stopped before it can cause a considerable
damage outside of its hierarchy this way even when the swap limit
is configured small.
Now that goes against the high limit semantic which should only throttle
the consumer and shouldn't cause any functional failures but maybe this
is acceptable for the overall system stability. An alternative would
be to throttle in the high limit reclaim context proportionally to
the excess. This is normally done by the reclaim itself but with no
reclaimable memory this wouldn't work that way.

This seems hard to use for users who want to control resource precisely
even if stability is good.

Another option would be to ignore the swap limit during the global
reclaim. This wouldn't stop the runaway loads but they would at least
see their fair share of the reclaim. The swap excess could be then used
as a "handicap" for a more aggressive throttling during high limit reclaim
or to trigger hard limit sooner.

This seems to work. But users need to understand swap-limit can be exceeded.

Or we could teach the global OOM killer to select abusive anon memory
users with restricted swap. That would require to iterate through all
memcgs and checks whether their anon consumption is in a large excess to
their swap limit and fallback to the memcg OOM victim selection if that
is the case. This adds more complexity to the OOM killer path so I am
not sure this is generally acceptable, though.


I think this is not acceptable.

My question now is. Is the knob usable/useful even without additional
heuristics? Do we want to protect swap space so rigidly that a swap
limited memcg can cause bigger problems than without the swap limit
globally?


swap requires some limit. If not, an application can eat up all swap
and it will not be never freed until the application access it or
swapoff runs.

Thanks,
-Kame

The swap usage can be monitored using new memory.swap.current file and
limited using memory.swap.max.

Signed-off-by: Vladimir Davydov <vdavydov@xxxxxxxxxxxxx>
---
include/linux/memcontrol.h | 1 +
include/linux/swap.h | 5 ++
mm/memcontrol.c | 123 +++++++++++++++++++++++++++++++++++++++++----
mm/shmem.c | 4 ++
mm/swap_state.c | 5 ++
5 files changed, 129 insertions(+), 9 deletions(-)

[...]



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/