Re: memcg reclaim demotion wrt. isolation

From: Johannes Weiner
Date: Wed Dec 14 2022 - 07:42:55 EST


On Wed, Dec 14, 2022 at 10:42:56AM +0100, Michal Hocko wrote:
> On Tue 13-12-22 17:14:48, Johannes Weiner wrote:
> > On Tue, Dec 13, 2022 at 04:41:10PM +0100, Michal Hocko wrote:
> > > Hi,
> > > I have just noticed that that pages allocated for demotion targets
> > > includes __GFP_KSWAPD_RECLAIM (through GFP_NOWAIT). This is the case
> > > since the code has been introduced by 26aa2d199d6f ("mm/migrate: demote
> > > pages during reclaim"). I suspect the intention is to trigger the aging
> > > on the fallback node and either drop or further demote oldest pages.
> > >
> > > This makes sense but I suspect that this wasn't intended also for
> > > memcg triggered reclaim. This would mean that a memory pressure in one
> > > hierarchy could trigger paging out pages of a different hierarchy if the
> > > demotion target is close to full.
> >
> > This is also true if you don't do demotion. If a cgroup tries to
> > allocate memory on a full node (i.e. mbind()), it may wake kswapd or
> > enter global reclaim directly which may push out the memory of other
> > cgroups, regardless of the respective cgroup limits.
>
> You are right on this. But this is describing a slightly different
> situaton IMO.
>
> > The demotion allocations don't strike me as any different. They're
> > just allocations on behalf of a cgroup. I would expect them to wake
> > kswapd and reclaim physical memory as needed.
>
> I am not sure this is an expected behavior. Consider the currently
> discussed memory.demote interface when the userspace can trigger
> (almost) arbitrary demotions. This can deplete fallback nodes without
> over-committing the memory overall yet push out demoted memory from
> other workloads. From the user POV it would look like a reclaim while
> the overall memory is far from depleted so it would be considered as
> premature and a warrant a bug report.
>
> The reclaim behavior would make more sense to me if it was constrained
> to the allocating memcg hierarchy so unrelated lruvecs wouldn't be
> disrupted.

What if the second tier is full, and the memcg you're trying to demote
doesn't have any pages to vacate on that tier yet? Will it fail to
demote?

Does that mean that a shared second tier node is only usable for the
cgroup that demotes to it first? And demotion stops for everybody else
until that cgroup vacates the node voluntarily?

As you can see, these would be unprecedented and quite surprising
first-come-first-serve memory protection semantics.

The only way to prevent cgroups from disrupting each other on NUMA
nodes is NUMA constraints. Cgroup per-node limits. That shields not
only from demotion, but also from DoS-mbinding, or aggressive
promotion. All of these can result in some form of premature
reclaim/demotion, proactive demotion isn't special in that way.

The default behavior for cgroups is that without limits or
protections, resource access is unconstrained and competitive. Without
NUMA constraints, it's very much expected that cgroups compete over
nodes, and that the hottest pages win out. Per aging rules, freshly
demoted pages are hotter than anything else on the target node, so it
should displace accordingly.

Consider the case where you have two lower tier nodes and there are
cpuset isolation for the main workloads, but some maintenance thing
runs and pollutes one of the lower tier nodes. Or consider the case
where a shared lower tier node is divvied up between two cgroups using
protection settings to allow overcommit, i.e. per-node memory.low.

Demotions, proactive or not, MUST do global reclaim on a full node.