Re: [PATCH] mm: deduct the number of pages reclaimed by madvise from workingset

From: Zhaoyang Huang
Date: Fri May 26 2023 - 02:39:16 EST


On Thu, May 25, 2023 at 9:54 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
>
> On Wed, May 24, 2023 at 05:12:54PM +0800, zhaoyang.huang wrote:
> > From: Zhaoyang Huang <zhaoyang.huang@xxxxxxxxxx>
> >
> > The pages reclaimed by madvise_pageout are made of inactive and dropped from LRU
> > forcefully, which lead to the coming up refault pages possess a large refault
> > distance than it should be. These could affect the accuracy of thrashing when
> > madvise_pageout is used as a common way of memory reclaiming as ANDROID does now.
>
> This alludes to, but doesn't explain, a real world usecase.
More block io(wait_on_page_bit_common) observed during APP start in
latest android version where user space memory reclaiming changes from
in-kernel PPR to madvise_pageout. We believe that it could be related
with inaccuracy of workingset.
>
> Yes, madvise_pageout() will record non-resident entries today. This
> means refault and thrash detection is on for user-driven reclaim.
>
> So why is that undesirable?
Let's raise an extreme scenario, that is, the tail page of LRU could
experience a given refault distance without any in-kernel reclaiming
and be wrongly deemed as inactive and get less protection.
>
> Today we measure and report the cost of reclaim and memory pressure
> for physical memory shortages, cgroup limits, and user-driven cgroup
> reclaim. Why should we not do the same for madv_pageout()? If the
> userspace code that drives pageout has a bug and the result is extreme
> thrashing, wouldn't you want to know that?
Actually, the pages evicted by madv_cold/pageout from active_lru are
not marked as WORKINGSET, which will surpass the thrashing account
when it faults back and gets struck by IO. I think they should be
treated in the same way in terms of SetPageWorkingset and
lruvec->non-resident. Please refer to my previous patch "mm: mark
folio as workingset in lru_deactivate_fn index 70e2063..4d1c14f
100644"


>
> Please explain the idea here better.
>
> > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@xxxxxxxxxx>
> > ---
> > include/linux/swap.h | 2 +-
> > mm/madvise.c | 4 ++--
> > mm/vmscan.c | 8 +++++++-
> > 3 files changed, 10 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 2787b84..0312142 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -428,7 +428,7 @@ extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
> > extern int vm_swappiness;
> > long remove_mapping(struct address_space *mapping, struct folio *folio);
> >
> > -extern unsigned long reclaim_pages(struct list_head *page_list);
> > +extern unsigned long reclaim_pages(struct mm_struct *mm, struct list_head *page_list);
> > #ifdef CONFIG_NUMA
> > extern int node_reclaim_mode;
> > extern int sysctl_min_unmapped_ratio;
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index b6ea204..61c8d7b 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -420,7 +420,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> > huge_unlock:
> > spin_unlock(ptl);
> > if (pageout)
> > - reclaim_pages(&page_list);
> > + reclaim_pages(mm, &page_list);
> > return 0;
> > }
> >
> > @@ -516,7 +516,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> > arch_leave_lazy_mmu_mode();
> > pte_unmap_unlock(orig_pte, ptl);
> > if (pageout)
> > - reclaim_pages(&page_list);
> > + reclaim_pages(mm, &page_list);
> > cond_resched();
> >
> > return 0;
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 20facec..048c10b 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2741,12 +2741,14 @@ static unsigned int reclaim_folio_list(struct list_head *folio_list,
> > return nr_reclaimed;
> > }
> >
> > -unsigned long reclaim_pages(struct list_head *folio_list)
> > +unsigned long reclaim_pages(struct mm_struct *mm, struct list_head *folio_list)
> > {
> > int nid;
> > unsigned int nr_reclaimed = 0;
> > LIST_HEAD(node_folio_list);
> > unsigned int noreclaim_flag;
> > + struct lruvec *lruvec;
> > + struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
> >
> > if (list_empty(folio_list))
> > return nr_reclaimed;
> > @@ -2764,10 +2766,14 @@ unsigned long reclaim_pages(struct list_head *folio_list)
> > }
> >
> > nr_reclaimed += reclaim_folio_list(&node_folio_list, NODE_DATA(nid));
> > + lruvec = &memcg->nodeinfo[nid]->lruvec;
> > + workingset_age_nonresident(lruvec, -nr_reclaimed);
> > nid = folio_nid(lru_to_folio(folio_list));
> > } while (!list_empty(folio_list));
> >
> > nr_reclaimed += reclaim_folio_list(&node_folio_list, NODE_DATA(nid));
> > + lruvec = &memcg->nodeinfo[nid]->lruvec;
> > + workingset_age_nonresident(lruvec, -nr_reclaimed);
>
> The task might have moved cgroups in between, who knows what kind of
> artifacts it will introduce if you wind back the wrong clock.
>
> If there are reclaim passes that shouldn't participate in non-resident
> tracking, that should be plumbed through the stack to __remove_mapping
> (which already has that bool reclaimed param to not record entries).