Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

From: Feng Tang
Date: Wed Oct 26 2022 - 08:20:22 EST


On Wed, Oct 26, 2022 at 05:19:50PM +0800, Michal Hocko wrote:
> On Wed 26-10-22 16:00:13, Feng Tang wrote:
> > On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote:
> > > On 10/26/22 1:13 PM, Feng Tang wrote:
> > > > In page reclaim path, memory could be demoted from faster memory tier
> > > > to slower memory tier. Currently, there is no check about cpuset's
> > > > memory policy, that even if the target demotion node is not allowd
> > > > by cpuset, the demotion will still happen, which breaks the cpuset
> > > > semantics.
> > > >
> > > > So add cpuset policy check in the demotion path and skip demotion
> > > > if the demotion targets are not allowed by cpuset.
> > > >
> > >
> > > What about the vma policy or the task memory policy? Shouldn't we respect
> > > those memory policy restrictions while demoting the page?
> >
> > Good question! We have some basic patches to consider memory policy
> > in demotion path too, which are still under test, and will be posted
> > soon. And the basic idea is similar to this patch.
>
> For that you need to consult each vma and it's owning task(s) and that
> to me sounds like something to be done in folio_check_references.
> Relying on memcg to get a cpuset cgroup is really ugly and not really
> 100% correct. Memory controller might be disabled and then you do not
> have your association anymore.

You are right, for cpuset case, the solution depends on 'CONFIG_MEMCG=y',
and the bright side is most of distribution have it on.

> This all can get quite expensive so the primary question is, does the
> existing behavior generates any real issues or is this more of an
> correctness exercise? I mean it certainly is not great to demote to an
> incompatible numa node but are there any reasonable configurations when
> the demotion target node is explicitly excluded from memory
> policy/cpuset?

We haven't got customer report on this, but there are quite some customers
use cpuset to bind some specific memory nodes to a docker (You've helped
us solve a OOM issue in such cases), so I think it's practical to respect
the cpuset semantics as much as we can.

Your concern about the expensive cost makes sense! Some raw ideas are:
* if the shrink_folio_list is called by kswapd, the folios come from
the same per-memcg lruvec, so only one check is enough
* if not from kswapd, like called form madvise or DAMON code, we can
save a memcg cache, and if the next folio's memcg is same as the
cache, we reuse its result. And due to the locality, the real
check is rarely performed.

Thanks,
Feng

> --
> Michal Hocko
> SUSE Labs
>