Re: memcg reclaim demotion wrt. isolation

From: Michal Hocko
Date: Wed Dec 14 2022 - 04:43:17 EST


On Tue 13-12-22 17:14:48, Johannes Weiner wrote:
> On Tue, Dec 13, 2022 at 04:41:10PM +0100, Michal Hocko wrote:
> > Hi,
> > I have just noticed that that pages allocated for demotion targets
> > includes __GFP_KSWAPD_RECLAIM (through GFP_NOWAIT). This is the case
> > since the code has been introduced by 26aa2d199d6f ("mm/migrate: demote
> > pages during reclaim"). I suspect the intention is to trigger the aging
> > on the fallback node and either drop or further demote oldest pages.
> >
> > This makes sense but I suspect that this wasn't intended also for
> > memcg triggered reclaim. This would mean that a memory pressure in one
> > hierarchy could trigger paging out pages of a different hierarchy if the
> > demotion target is close to full.
>
> This is also true if you don't do demotion. If a cgroup tries to
> allocate memory on a full node (i.e. mbind()), it may wake kswapd or
> enter global reclaim directly which may push out the memory of other
> cgroups, regardless of the respective cgroup limits.

You are right on this. But this is describing a slightly different
situaton IMO.

> The demotion allocations don't strike me as any different. They're
> just allocations on behalf of a cgroup. I would expect them to wake
> kswapd and reclaim physical memory as needed.

I am not sure this is an expected behavior. Consider the currently
discussed memory.demote interface when the userspace can trigger
(almost) arbitrary demotions. This can deplete fallback nodes without
over-committing the memory overall yet push out demoted memory from
other workloads. From the user POV it would look like a reclaim while
the overall memory is far from depleted so it would be considered as
premature and a warrant a bug report.

The reclaim behavior would make more sense to me if it was constrained
to the allocating memcg hierarchy so unrelated lruvecs wouldn't be
disrupted.

--
Michal Hocko
SUSE Labs