Re: [RFC v2] mm: Multi-Gen LRU: fix use mm/page_idle/bitmap

From: Yu Zhao
Date: Wed Dec 06 2023 - 20:31:04 EST


On Wed, Dec 6, 2023 at 5:51 AM Henry Huang <henry.hj@xxxxxxxxxxxx> wrote:
>
> Multi-Gen LRU page-table walker clears pte young flag, but it doesn't
> clear page idle flag. When we use /sys/kernel/mm/page_idle/bitmap to check
> whether one page is accessed, it would tell us this page is idle,
> but actually this page has been accessed.
>
> For those unmapped filecache pages, page idle flag would not been
> cleared in folio_mark_accessed if Multi-Gen LRU is enabled.
> So we couln't use /sys/kernel/mm/page_idle/bitmap to check whether
> a filecache page is read or written.
>
> What's more, /sys/kernel/mm/page_idle/bitmap also clears pte young flag.
> If one page is accessed, it would set page young flag. Multi-Gen LRU
> page-table walker should check both page&pte young flags.
>
> how-to-reproduce-problem
>
> idle_page_track
> a tools to track process accessed memory during a specific time
> usage
> idle_page_track $pid $time
> how-it-works
> 1. scan process vma from /proc/$pid/maps
> 2. vfn --> pfn from /proc/$pid/pagemap
> 3. write /sys/kernel/mm/page_idle/bitmap to
> mark phy page idle flag and clear pte young flag
> 4. sleep $time
> 5. read /sys/kernel/mm/page_idle/bitmap to
> test_and_clear pte young flag and
> return whether phy page is accessed
>
> test ---- test program
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
>
> int main(int argc, const char *argv[])
> {
> char *buf = NULL;
> char pipe_info[4096];
> int n;
> int fd = -1;
>
> buf = malloc(1024*1024*1024UL);
> memset(buf, 0, 1024*1024*1024UL);
> fd = open("access.pipe", O_RDONLY);
> if (fd < 0)
> goto out;
> while (1) {
> n = read(fd, pipe_info, sizeof(pipe_info));
> if (!n) {
> sleep(1);
> continue;
> } else if (n < 0) {
> break;
> }
> memset(buf, 0, 1024*1024*1024UL);
> puts("finish access");
> }
> out:
> if (fd >=0)
> close(fd);
> if (buf)
> free(buf);
>
> return 0;
> }
>
> prepare:
> mkfifo access.pipe
> ./test
> ps -ef | grep test
> root 4106 3148 8 06:47 pts/0 00:00:01 ./test
>
> We use /sys/kernel/debug/lru_gen to simulate mglru page-table scan.
>
> case 1: mglru walker break page_idle
> ./idle_page_track 4106 60 &
> sleep 5; echo 1 > access.pipe
> sleep 5; echo '+ 8 0 6 1 1' > /sys/kernel/debug/lru_gen
>
> the output of idle_page_track is:
> Est(s) Ref(MB)
> 64.822 1.00
> only found 1MB were accessed during 64.822s, but actually 1024MB were
> accessed.
>
> case 2: page_idle break mglru walker
> echo 1 > access.pipe
> ./idle_page_track 4106 10
> echo '+ 8 0 7 1 1' > /sys/kernel/debug/lru_gen
> lru gen status:
> memcg 8 /user.slice
> node 0
> 5 772458 1065 9735
> 6 737435 262244 72
> 7 538053 1184 632
> 8 59404 6422 0
> almost pages should be in max_seq-1 queue, but actually not.
>
> Signed-off-by: Henry Huang <henry.hj@xxxxxxxxxxxx>

It's never intended for MGLRU to support page_idle/bitmap or
PG_idle/young because:
1. page_idle/bitmap isn't a capable interface at all -- yes, Google
proposed the idea [1], but we don't really use it anymore because of
its poor scalability.
2. PG_idle/young, being a boolean value, has poor granularity. If
anyone must use page_idle/bitmap for some specific reason, I'd
recommend exporting generation numbers instead.

[1] https://lore.kernel.org/cover.1426706637.git.vdavydov@xxxxxxxxxxxxx/