Re: [RFC v2] mm: Multi-Gen LRU: fix use mm/page_idle/bitmap

From: Yu Zhao
Date: Sat Dec 16 2023 - 16:07:51 EST


On Fri, Dec 15, 2023 at 3:53 AM Henry Huang <henry.hj@xxxxxxxxxxxx> wrote:
>
> On Fri, Dec 15, 2023 at 14:46 PM Yu Zhao <yuzhao@xxxxxxxxxx> wrote:
> > >
> > > Thanks for replying this RFC.
> > >
> > > > 1. page_idle/bitmap isn't a capable interface at all -- yes, Google
> > > > proposed the idea [1], but we don't really use it anymore because of
> > > > its poor scalability.
> > >
> > > In our environment, we use /sys/kernel/mm/page_idle/bitmap to check
> > > pages whether were accessed during a peroid of time.
> >
> > Is it a production environment? If so, what's your
> > 1. scan interval
> > 2. memory size
>
> > I'm trying to understand why scalability isn't a problem for you. On
> > an average server, there are hundreds of millions of PFNs, so it'd be
> > very expensive to use that ABI even for a time interval of minutes.
>
> Thanks for replying.
>
> Our scan interval is 10 minutes and total memory size is 512GB.
> We perferred to reclaim pages which idle age > 1 hour at least.

Yes, that makes sense. We have similar use cases, details below.

> > > We manage all pages
> > > idle time in userspace. Then use a prediction algorithm to select pages
> > > to reclaim. These pages would more likely be idled for a long time.
>
> > "There is a system in place now that is based on a user-space process
> > that reads a bitmap stored in sysfs, but it has a high CPU and memory
> > overhead, so a new approach is being tried."
> > https://lwn.net/Articles/787611/
> >
> > Could you elaborate how you solved this problem?
>
> In out environment, we found that we take average 0.4 core and 300MB memory
> to do scan, basic analyse and reclaim idle pages.
>
> For reducing cpu & memroy usage, we do:
> 1. We implement a ratelimiter to control rate of scan and reclaim.
> 2. All pages info & idle age were stored in local DB file. Our prediction
> algorithm don't need all pages info in memory at the same time.
>
> In out environment, about 1/3 memory was attemped to allocate as THP,
> which may save some cpu usage of scan.
>
> > > We only need kernel to tell use whether a page is accessed, a boolean
> > > value in kernel is enough for our case.
> >
> > How do you define "accessed"? I.e., through page tables or file
> > descriptors or both?
>
> both
>
> > > > 2. PG_idle/young, being a boolean value, has poor granularity. If
> > > > anyone must use page_idle/bitmap for some specific reason, I'd
> > > > recommend exporting generation numbers instead.
> > >
> > > Yes, at first time, we try using multi-gen LRU proactvie scan and
> > > exporting generation&refs number to do the same thing.
> > >
> > > But there are serveral problems:
> > >
> > > 1. multi-gen LRU only care about self-memcg pages. In our environment,
> > > it's likely to see that different memcg's process share pages.
> >
> > This is related to my question above: are those pages mapped into
> > different memcgs or not?
>
> There is a case:
> There are two cgroup A, B (B is child cgroup of A)
> Process in A create a file and use mmap to read/write this file.
> Process in B mmap this file and usually read this file.

Yes, actually we have a private patch to solve a similar problem.
Basically it finds VMAs from other processes in different memcgs that
share a mapping and jumps to those VMAs to scan them. We can upstream
it for you if you find it useful too.

> > > We still have no ideas how to solve this problem.
> > >
> > > 2. We set swappiness 0, and use proactive scan to select cold pages
> > > & proactive reclaim to swap anon pages. But we can't control passive
> > > scan(can_swap = false), which would make anon pages cold/hot inversion
> > > in inc_min_seq.
> >
> > There is an option to prevent the inversion, IIUC, the force_scan
> > option is what you are looking for.
>
> It seems that doesn't work now.
>
> static void inc_max_seq(struct lruvec *lruvec, bool can_swap, bool force_scan)
> {
> ......
> for (type = ANON_AND_FILE - 1; type >= 0; type--) {
> if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
> continue;
>
> VM_WARN_ON_ONCE(!force_scan && (type == LRU_GEN_FILE || can_swap));
>
> if (inc_min_seq(lruvec, type, can_swap))
> continue;
>
> spin_unlock_irq(&lruvec->lru_lock);
> cond_resched();
> goto restart;
> }
> .....
> }
>
> force_scan is not a parameter of inc_min_seq.
> In our environment, swappiness is 0, so can_swap would be false.
>
> static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
> {
> int zone;
> int remaining = MAX_LRU_BATCH;
> struct lru_gen_folio *lrugen = &lruvec->lrugen;
> int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
>
> if (type == LRU_GEN_ANON && !can_swap)
> goto done;
> ......
> }
>
> If can_swap is false, would pass anon lru list.
>
> What's more, in passive scan, force_scan is also false.

Ok, I see what you mean. (I thought "passive" means proactive scans
triggered by the debugfs interface, but it actually means "reactive"
scans triggered by memory pressure.) We actually have a private patch
too to solve this. But there is a corner case here: that private
change, which is essentially the same as what you suggested, can stall
direct reclaim when there is tons of cold anon memory. E.g., if there
is 300GB anon memory in the oldest generation which can't be swapped,
calling inc_min_seq() with can_swap being true would stall the direct
reclaim. Does it make sense?

Let me check the state of those private patches and get back to you in
a couple of days. Thanks!