Re: [PATCH] memcg: remove mem_cgroup_reclaimable check from soft reclaim

From: Michal Hocko
Date: Wed Oct 22 2014 - 09:51:35 EST


On Wed 22-10-14 08:40:25, Johannes Weiner wrote:
> On Wed, Oct 22, 2014 at 01:21:16PM +0200, Michal Hocko wrote:
> > On Tue 21-10-14 14:22:39, Johannes Weiner wrote:
> > [...]
> > > From 27bd24b00433d9f6c8d60ba2b13dbff158b06c13 Mon Sep 17 00:00:00 2001
> > > From: Johannes Weiner <hannes@xxxxxxxxxxx>
> > > Date: Tue, 21 Oct 2014 09:53:54 -0400
> > > Subject: [patch] mm: memcontrol: do not filter reclaimable nodes in NUMA
> > > round-robin
> > >
> > > The round-robin node reclaim currently tries to include only nodes
> > > that have memory of the memcg in question, which is quite elaborate.
> > >
> > > Just use plain round-robin over the nodes that are allowed by the
> > > task's cpuset, which are the most likely to contain that memcg's
> > > memory. But even if zones without memcg memory are encountered,
> > > direct reclaim will skip over them without too much hassle.
> >
> > I do not think that using the current's node mask is correct. Different
> > tasks in the same memcg might be bound to different nodes and then a set
> > of nodes might be reclaimed much more if a particular task hits limit
> > more often. It also doesn't make much sense from semantical POV, we are
> > reclaiming memcg so the mask should be union of all tasks allowed nodes.
>
> Unless the cpuset hierarchy is separate from the memcg hierarchy, all
> tasks in the memcg belong to the same cpuset. And the whole point of
> cpusets is that a group of tasks has the same nodemask, no?

Memory limit and memory placement are orthogonal configurations and they
might be stacked one on top of other in both directions.

> Sure, there are *possible* configurations for which this assumption
> breaks, like multiple hierarchies, but are they sensible? Do we care?

Why wouldn't they be sensible? What is wrong about limiting memory of
a load which internally uses node placement for its components?

> > What we do currently is overly complicated though and I agree that there
> > is no good reason for it.
> > Let's just s@cpuset_current_mems_allowed@node_online_map@ and round
> > robin over all nodes. As you said we do not have to optimize for empty
> > zones.
>
> That was what I first had. And cpuset_current_mems_allowed defaults
> to node_online_map, but once the user sets up cpusets in conjunction
> with memcgs, it seems to be the preferred value.
>
> The other end of this is that if you have 16 nodes and use cpuset to
> bind the task to node 14 and 15, round-robin iterations of node 1-13
> will reclaim the group's memory on 14 and only the 15 iteration will
> actually look at memory from node 15 first.

mem_cgroup_select_victim_node can check reclaimability of the memcg
(hierarchy) and skip nodes without pages. Or would that be too
expensive? We are in the slow path already.

> It seems using the cpuset bindings, while theoretically independent,
> would do the right thing for all intents and purposes.

Only if cpuset is on top of memcg. Not the other way around as mentioned
above (possible node over-reclaim).
--
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/