Re: [RFC v2] mm: Multi-Gen LRU: fix use mm/page_idle/bitmap

From: Henry Huang
Date: Fri Dec 15 2023 - 06:09:48 EST


On Fri, Dec 15, 2023 at 14:46 PM Yu Zhao <yuzhao@xxxxxxxxxx> wrote:
> >
> > Thanks for replying this RFC.
> >
> > > 1. page_idle/bitmap isn't a capable interface at all -- yes, Google
> > > proposed the idea [1], but we don't really use it anymore because of
> > > its poor scalability.
> >
> > In our environment, we use /sys/kernel/mm/page_idle/bitmap to check
> > pages whether were accessed during a peroid of time.
>
> Is it a production environment? If so, what's your
> 1. scan interval
> 2. memory size

> I'm trying to understand why scalability isn't a problem for you. On
> an average server, there are hundreds of millions of PFNs, so it'd be
> very expensive to use that ABI even for a time interval of minutes.

Thanks for replying.

Our scan interval is 10 minutes and total memory size is 512GB.
We perferred to reclaim pages which idle age > 1 hour at least.

> > We manage all pages
> > idle time in userspace. Then use a prediction algorithm to select pages
> > to reclaim. These pages would more likely be idled for a long time.

> "There is a system in place now that is based on a user-space process
> that reads a bitmap stored in sysfs, but it has a high CPU and memory
> overhead, so a new approach is being tried."
> https://lwn.net/Articles/787611/
>
> Could you elaborate how you solved this problem?

In out environment, we found that we take average 0.4 core and 300MB memory
to do scan, basic analyse and reclaim idle pages.

For reducing cpu & memroy usage, we do:
1. We implement a ratelimiter to control rate of scan and reclaim.
2. All pages info & idle age were stored in local DB file. Our prediction
algorithm don't need all pages info in memory at the same time.

In out environment, about 1/3 memory was attemped to allocate as THP,
which may save some cpu usage of scan.

> > We only need kernel to tell use whether a page is accessed, a boolean
> > value in kernel is enough for our case.
>
> How do you define "accessed"? I.e., through page tables or file
> descriptors or both?

both

> > > 2. PG_idle/young, being a boolean value, has poor granularity. If
> > > anyone must use page_idle/bitmap for some specific reason, I'd
> > > recommend exporting generation numbers instead.
> >
> > Yes, at first time, we try using multi-gen LRU proactvie scan and
> > exporting generation&refs number to do the same thing.
> >
> > But there are serveral problems:
> >
> > 1. multi-gen LRU only care about self-memcg pages. In our environment,
> > it's likely to see that different memcg's process share pages.
>
> This is related to my question above: are those pages mapped into
> different memcgs or not?

There is a case:
There are two cgroup A, B (B is child cgroup of A)
Process in A create a file and use mmap to read/write this file.
Process in B mmap this file and usually read this file.

> > We still have no ideas how to solve this problem.
> >
> > 2. We set swappiness 0, and use proactive scan to select cold pages
> > & proactive reclaim to swap anon pages. But we can't control passive
> > scan(can_swap = false), which would make anon pages cold/hot inversion
> > in inc_min_seq.
>
> There is an option to prevent the inversion, IIUC, the force_scan
> option is what you are looking for.

It seems that doesn't work now.

static void inc_max_seq(struct lruvec *lruvec, bool can_swap, bool force_scan)
{
......
for (type = ANON_AND_FILE - 1; type >= 0; type--) {
if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
continue;

VM_WARN_ON_ONCE(!force_scan && (type == LRU_GEN_FILE || can_swap));

if (inc_min_seq(lruvec, type, can_swap))
continue;

spin_unlock_irq(&lruvec->lru_lock);
cond_resched();
goto restart;
}
.....
}

force_scan is not a parameter of inc_min_seq.
In our environment, swappiness is 0, so can_swap would be false.

static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
{
int zone;
int remaining = MAX_LRU_BATCH;
struct lru_gen_folio *lrugen = &lruvec->lrugen;
int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);

if (type == LRU_GEN_ANON && !can_swap)
goto done;
......
}

If can_swap is false, would pass anon lru list.

What's more, in passive scan, force_scan is also false.

static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, bool can_swap)
{
......
/* skip this lruvec as it's low on cold folios */
return try_to_inc_max_seq(lruvec, max_seq, sc, can_swap, false) ? -1 : 0;
}

Is it a good idea to include a global parameter no_inversion, and modify inc_min_seq
like this:

static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
{
int zone;
int remaining = MAX_LRU_BATCH;
struct lru_gen_folio *lrugen = &lruvec->lrugen;
int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);

- if (type == LRU_GEN_ANON && !can_swap)
+ if (type == LRU_GEN_ANON && !can_swap && !no_inversion)
goto done;
......
}

--
2.43.0