Re: [PATCH v8 11/11] KVM: selftests: Add multi-gen LRU aging to access_tracking_perf_test
From: James Houghton
Date: Mon Feb 03 2025 - 17:46:58 EST
On Fri, Jan 10, 2025 at 4:12 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> This can/should be posted separately, no? MGLRU support for secondary MMUs went
> through mm/, the KVM changes in this series are all about making KVM x86 faster.
I'll send this patch separately, sure.
>
> On Tue, Nov 05, 2024, James Houghton wrote:
> > This test now has two modes of operation:
>
> Bad Google3, bad! Write changelogs as commands, i.e. state what the patch is
> doing. The above is also misleading (at least, it was to me), as I assumed one
> of the modes would be "legacy" and the other would be MGLRU. But it looks like
> this patch adds MGLRU support *and* benchmarking.
>
> This should be split into at least two patches, possibly three (I can't tell how
> much pre-work there is). I.e. add MGLRU support, and then add the benchmarking
> stuff. And if there's substantial refactoring and/or new utilities, do that first.
>
> > 1. (default) To check how much vCPU performance was affected by access
> > tracking (previously existed, now supports MGLRU aging).
> > 2. (-p) To also benchmark how fast MGLRU can do aging while vCPUs are
> > faulting in memory.
> >
> > Mode (1) also serves as a way to verify that aging is working properly
> > for pages only accessed by KVM. It will fail if one does not have the
>
> "one" what?
>
> > 0x8 lru_gen feature bit.
> >
> > To support MGLRU, the test creates a memory cgroup, moves itself into
> > it, then uses the lru_gen debugfs output to track memory in that cgroup.
> > The logic to parse the lru_gen debugfs output has been put into
> > selftests/kvm/lib/lru_gen_util.c.
> >
> > Co-developed-by: Axel Rasmussen <axelrasmussen@xxxxxxxxxx>
> > Signed-off-by: Axel Rasmussen <axelrasmussen@xxxxxxxxxx>
> > Signed-off-by: James Houghton <jthoughton@xxxxxxxxxx>
>
> ...
>
> > @@ -47,6 +48,19 @@
> > #include "memstress.h"
> > #include "guest_modes.h"
> > #include "processor.h"
> > +#include "lru_gen_util.h"
> > +
> > +static const char *TEST_MEMCG_NAME = "access_tracking_perf_test";
> > +static const int LRU_GEN_ENABLED = 0x1;
> > +static const int LRU_GEN_MM_WALK = 0x2;
>
> Is there really no uAPI #define for these?
>
> > +static const char *CGROUP_PROCS = "cgroup.procs";
> > +/*
> > + * If using MGLRU, this test assumes a cgroup v2 or cgroup v1 memory hierarchy
>
> I would say "requires", not "assumes".
>
> > + * is mounted at cgroup_root.
> > + *
> > + * Can be changed with -r.
>
> This is amusingly vague. I vote to omit explicitly referencing the command line
> option, we have enough problems maintaining them as it is. Instead simply say
> something like "Default to the kernel's preferred path for mounting cgroups" to
> both explain where the default comes from, and to give the reader a hint that the
> path can be changed.
>
> Actually, there's zero reason for this to be global. More below.
>
> Ugh, and if this test is manipulating cgroups, won't it need to root? I doubt
> y'all have tried to run this outside of devrez, i.e. on a system where you're
> not logged in as root.
>
> Oh, never mind, this test already effectively requires root.
>
> > + /* Whether to use lru_gen aging instead of idle page tracking. */
> > + bool lru_gen;
>
> Needs a verb, otherwise this looks like it tracks the LRU generation. E.g. use_lru_gen.
>
> > +
> > + /* Whether to test the performance of aging itself. */
> > + bool benchmark_lru_gen;
> > };
> >
> > static uint64_t pread_uint64(int fd, const char *filename, uint64_t index)
> > @@ -89,6 +112,50 @@ static uint64_t pread_uint64(int fd, const char *filename, uint64_t index)
> >
> > }
> >
> > +static void write_file_long(const char *path, long v)
> > +{
> > + FILE *f;
> > +
> > + f = fopen(path, "w");
> > + TEST_ASSERT(f, "fopen(%s) failed", path);
> > + TEST_ASSERT(fprintf(f, "%ld\n", v) > 0,
> > + "fprintf to %s failed", path);
> > + TEST_ASSERT(!fclose(f), "fclose(%s) failed", path);
> > +}
> > +
> > +static char *path_join(const char *parent, const char *child)
> > +{
> > + char *out = NULL;
> > +
> > + return asprintf(&out, "%s/%s", parent, child) >= 0 ? out : NULL;
> > +}
>
> These are common utilities, no? I.e. should be somewhere common, not buried in
> this test.
>
> > +
> > +static char *memcg_path(const char *memcg)
> > +{
> > + return path_join(cgroup_root, memcg);
>
> Eh, do the join when cgroup_root is first defined. Actually, looking at the
> usage more closely, the cgroup path is only used during main(). Just do all of
> the joins there, I see no reason to have these one-off helpers.
>
> > +}
> > +
> > +static char *memcg_file_path(const char *memcg, const char *file)
> > +{
> > + char *mp = memcg_path(memcg);
> > + char *fp;
> > +
> > + if (!mp)
> > + return NULL;
>
> Returning NULL just so that the one user can assert on !NULL is rather pointless.
>
> > + fp = path_join(mp, file);
> > + free(mp);
> > + return fp;
> > +}
> > +
> > +static void move_to_memcg(const char *memcg, pid_t pid)
> > +{
> > + char *procs = memcg_file_path(memcg, CGROUP_PROCS);
> > +
> > + TEST_ASSERT(procs, "Failed to construct cgroup.procs path");
> > + write_file_long(procs, pid);
> > + free(procs);
> > +}
> > +
> > #define PAGEMAP_PRESENT (1ULL << 63)
> > #define PAGEMAP_PFN_MASK ((1ULL << 55) - 1)
> >
> > @@ -242,6 +309,8 @@ static void vcpu_thread_main(struct memstress_vcpu_args *vcpu_args)
> > };
> >
> > vcpu_last_completed_iteration[vcpu_idx] = current_iteration;
> > + clock_gettime(CLOCK_MONOTONIC,
> > + &vcpu_last_completed_time[vcpu_idx]);
> > }
> > }
> >
> > @@ -253,38 +322,68 @@ static void spin_wait_for_vcpu(int vcpu_idx, int target_iteration)
> > }
> > }
> >
> > +static bool all_vcpus_done(int target_iteration, int nr_vcpus)
> > +{
> > + for (int i = 0; i < nr_vcpus; ++i)
>
> Preferred style is to declare variables outside of loops.
>
> Needs curly braces.
>
> > + if (READ_ONCE(vcpu_last_completed_iteration[i]) !=
> > + target_iteration)
>
> Don't wrap.
>
> > + return false;
> > +
> > + return true;
> > +}
> > +
> > /* The type of memory accesses to perform in the VM. */
> > enum access_type {
> > ACCESS_READ,
> > ACCESS_WRITE,
> > };
> >
> > -static void run_iteration(struct kvm_vm *vm, int nr_vcpus, const char *description)
> > +static void run_iteration(struct kvm_vm *vm, int nr_vcpus, const char *description,
> > + bool wait)
> > {
> > - struct timespec ts_start;
> > - struct timespec ts_elapsed;
> > int next_iteration, i;
> >
> > /* Kick off the vCPUs by incrementing iteration. */
> > next_iteration = ++iteration;
> >
> > - clock_gettime(CLOCK_MONOTONIC, &ts_start);
> > -
> > /* Wait for all vCPUs to finish the iteration. */
> > - for (i = 0; i < nr_vcpus; i++)
> > - spin_wait_for_vcpu(i, next_iteration);
> > + if (wait) {
> > + struct timespec ts_start;
> > + struct timespec ts_elapsed;
> > +
> > + clock_gettime(CLOCK_MONOTONIC, &ts_start);
> >
> > - ts_elapsed = timespec_elapsed(ts_start);
> > - pr_info("%-30s: %ld.%09lds\n",
> > - description, ts_elapsed.tv_sec, ts_elapsed.tv_nsec);
> > + for (i = 0; i < nr_vcpus; i++)
> > + spin_wait_for_vcpu(i, next_iteration);
> > +
> > + ts_elapsed = timespec_elapsed(ts_start);
> > +
> > + pr_info("%-30s: %ld.%09lds\n",
> > + description, ts_elapsed.tv_sec, ts_elapsed.tv_nsec);
> > + } else
>
> Needs curly braces.
>
> > + pr_info("%-30s\n", description);
> > }
> >
> > -static void access_memory(struct kvm_vm *vm, int nr_vcpus,
> > - enum access_type access, const char *description)
> > +static void _access_memory(struct kvm_vm *vm, int nr_vcpus,
>
> Use double underscores (ignore any precedent in selftests that uses just one,
> we're trying to purge that ugliness).
>
> > + enum access_type access, const char *description,
> > + bool wait)
> > {
> > memstress_set_write_percent(vm, (access == ACCESS_READ) ? 0 : 100);
> > iteration_work = ITERATION_ACCESS_MEMORY;
> > - run_iteration(vm, nr_vcpus, description);
> > + run_iteration(vm, nr_vcpus, description, wait);
> > +}
> > +
> > +static void access_memory(struct kvm_vm *vm, int nr_vcpus,
> > + enum access_type access, const char *description)
> > +{
> > + return _access_memory(vm, nr_vcpus, access, description, true);
> > +}
> > +
> > +static void access_memory_async(struct kvm_vm *vm, int nr_vcpus,
>
> Maybe "nowait" instead of "async"? Yeah, something ends up being asynchronous
> (presumably), but the call itself is "synchronous", i.e. isn't spun off to a
> worker or anything.
>
> > + enum access_type access,
> > + const char *description)
> > +{
> > + return _access_memory(vm, nr_vcpus, access, description, false);
> > }
> >
> > static void mark_memory_idle(struct kvm_vm *vm, int nr_vcpus)
> > @@ -297,19 +396,115 @@ static void mark_memory_idle(struct kvm_vm *vm, int nr_vcpus)
> > */
> > pr_debug("Marking VM memory idle (slow)...\n");
> > iteration_work = ITERATION_MARK_IDLE;
> > - run_iteration(vm, nr_vcpus, "Mark memory idle");
> > + run_iteration(vm, nr_vcpus, "Mark memory idle", true);
> > }
> >
> > -static void run_test(enum vm_guest_mode mode, void *arg)
> > +static void create_memcg(const char *memcg)
> > +{
> > + const char *full_memcg_path = memcg_path(memcg);
> > + int ret;
> > +
> > + TEST_ASSERT(full_memcg_path, "Failed to construct full memcg path");
> > +retry:
> > + ret = mkdir(full_memcg_path, 0755);
> > + if (ret && errno == EEXIST) {
> > + TEST_ASSERT(!rmdir(full_memcg_path),
> > + "Found existing memcg at %s, but rmdir failed",
> > + full_memcg_path);
> > + goto retry;
>
> while (1) {
> ret = mkdir(full_memcg_path, 0755);
> if (!ret)
> break;
>
> TEST_ASSERT(errno == EEXIST,
> Creating the memcg via 'mkdir(%s)' failed",
> full_memcg_path);
>
> TEST_ASSERT(!rmdir(full_memcg_path),
> "Found existing memcg at %s, but rmdir failed",
> full_memcg_path);
> }
>
> > + }
> > + TEST_ASSERT(!ret, "Creating the memcg failed: mkdir(%s) failed",
>
> Heh, so it failed?
>
> > + full_memcg_path);
> > +
> > + pr_info("Created memcg at %s\n", full_memcg_path);
> > +}
>
> ...
>
> > +/*
> > + * Test how much access tracking affects vCPU performance.
> > + *
> > + * Supports two modes of access tracking:
> > + * - idle page tracking
> > + * - lru_gen aging
> > + *
> > + * When using lru_gen, this test additionally verifies that the pages are in
> > + * fact getting younger and older, otherwise the performance data would be
> > + * invalid.
> > + *
> > + * The forced lru_gen aging can race with aging that occurs naturally.
> > + */
> > +static void run_test(enum vm_guest_mode mode, struct kvm_vm *vm,
> > + struct test_params *params)
> > +{
> > + int nr_vcpus = params->nr_vcpus;
> > + bool lru_gen = params->lru_gen;
> > + struct memcg_stats stats;
> > + // If guest_page_size is larger than the host's page size, the
> > + // guest (memstress) will only fault in a subset of the host's pages.
>
> No C++ comments, please.
>
> > + long total_pages = nr_vcpus * params->vcpu_memory_bytes /
> > + max(memstress_args.guest_page_size,
> > + (uint64_t)getpagesize());
>
> max_t() is probably better.
>
> > + int found_gens[5];
> >
> > pr_info("\n");
> > access_memory(vm, nr_vcpus, ACCESS_WRITE, "Populating memory");
> > @@ -319,11 +514,78 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> > access_memory(vm, nr_vcpus, ACCESS_READ, "Reading from populated memory");
> >
> > /* Repeat on memory that has been marked as idle. */
> > - mark_memory_idle(vm, nr_vcpus);
> > + if (lru_gen) {
> > + /* Do an initial page table scan */
> > + lru_gen_do_aging(&stats, TEST_MEMCG_NAME);
> > + TEST_ASSERT(sum_memcg_stats(&stats) >= total_pages,
> > + "Not all pages tracked in lru_gen stats.\n"
> > + "Is lru_gen enabled? Did the memcg get created properly?");
>
> Align indentation.
>
> TEST_ASSERT(sum_memcg_stats(&stats) >= total_pages,
> "Not all pages tracked in lru_gen stats.\n"
> "Is lru_gen enabled? Did the memcg get created properly?");
>
> > +
> > + /* Find the generation we're currently in (probably youngest) */
> > + found_gens[0] = lru_gen_find_generation(&stats, total_pages);
> > +
> > + /* Do an aging pass now */
> > + lru_gen_do_aging(&stats, TEST_MEMCG_NAME);
> > +
> > + /* Same generation, but a newer generation has been made */
> > + found_gens[1] = lru_gen_find_generation(&stats, total_pages);
> > + TEST_ASSERT(found_gens[1] == found_gens[0],
> > + "unexpected gen change: %d vs. %d",
> > + found_gens[1], found_gens[0]);
>
> I don't have any ideas off the top of my head, but there's gotta be a way to
> dedup these blocks.
>
> > + } else
>
> Needs curly braces.
>
> > + mark_memory_idle(vm, nr_vcpus);
> > +
> > access_memory(vm, nr_vcpus, ACCESS_WRITE, "Writing to idle memory");
> > - mark_memory_idle(vm, nr_vcpus);
> > +
> > + if (lru_gen) {
> > + /* Scan the page tables again */
> > + lru_gen_do_aging(&stats, TEST_MEMCG_NAME);
> > +
> > + /* The pages should now be young again, so in a newer generation */
> > + found_gens[2] = lru_gen_find_generation(&stats, total_pages);
> > + TEST_ASSERT(found_gens[2] > found_gens[1],
> > + "pages did not get younger");
> > +
> > + /* Do another aging pass */
> > + lru_gen_do_aging(&stats, TEST_MEMCG_NAME);
> > +
> > + /* Same generation; new generation has been made */
> > + found_gens[3] = lru_gen_find_generation(&stats, total_pages);
> > + TEST_ASSERT(found_gens[3] == found_gens[2],
> > + "unexpected gen change: %d vs. %d",
> > + found_gens[3], found_gens[2]);
> > + } else
>
> Once more, with feeling!
>
> > + mark_memory_idle(vm, nr_vcpus);
> > +
> > access_memory(vm, nr_vcpus, ACCESS_READ, "Reading from idle memory");
> >
> > + if (lru_gen) {
> > + /* Scan the pages tables again */
> > + lru_gen_do_aging(&stats, TEST_MEMCG_NAME);
> > +
> > + /* The pages should now be young again, so in a newer generation */
> > + found_gens[4] = lru_gen_find_generation(&stats, total_pages);
> > + TEST_ASSERT(found_gens[4] > found_gens[3],
> > + "pages did not get younger");
> > + }
> > +}
>
> ...
>
> > @@ -353,13 +618,15 @@ int main(int argc, char *argv[])
> > .backing_src = DEFAULT_VM_MEM_SRC,
> > .vcpu_memory_bytes = DEFAULT_PER_VCPU_MEM_SIZE,
> > .nr_vcpus = 1,
> > + .lru_gen = false,
> > + .benchmark_lru_gen = false,
> > };
> > int page_idle_fd;
> > int opt;
> >
> > guest_modes_append_default();
> >
> > - while ((opt = getopt(argc, argv, "hm:b:v:os:")) != -1) {
> > + while ((opt = getopt(argc, argv, "hm:b:v:os:lr:p")) != -1) {
> > switch (opt) {
> > case 'm':
> > guest_modes_cmdline(optarg);
> > @@ -376,6 +643,15 @@ int main(int argc, char *argv[])
> > case 's':
> > params.backing_src = parse_backing_src_type(optarg);
> > break;
> > + case 'l':
> > + params.lru_gen = true;
> > + break;
> > + case 'p':
> > + params.benchmark_lru_gen = true;
> > + break;
> > + case 'r':
> > + cgroup_root = strdup(optarg);
> > + break;
> > case 'h':
> > default:
> > help(argv[0]);
> > @@ -383,12 +659,42 @@ int main(int argc, char *argv[])
> > }
> > }
> >
> > - page_idle_fd = open("/sys/kernel/mm/page_idle/bitmap", O_RDWR);
> > - __TEST_REQUIRE(page_idle_fd >= 0,
> > - "CONFIG_IDLE_PAGE_TRACKING is not enabled");
> > - close(page_idle_fd);
> > + if (!params.lru_gen) {
> > + page_idle_fd = open("/sys/kernel/mm/page_idle/bitmap", O_RDWR);
> > + __TEST_REQUIRE(page_idle_fd >= 0,
> > + "CONFIG_IDLE_PAGE_TRACKING is not enabled");
> > + close(page_idle_fd);
> > + } else {
> > + int lru_gen_fd, lru_gen_debug_fd;
> > + long mglru_features;
> > + char mglru_feature_str[8] = {};
> > +
> > + lru_gen_fd = open("/sys/kernel/mm/lru_gen/enabled", O_RDONLY);
> > + __TEST_REQUIRE(lru_gen_fd >= 0,
> > + "CONFIG_LRU_GEN is not enabled");
>
> Noooo! Do not assume opening a path failed because some config option. That
> reminds me, this needs to be rewritten. I can't count the number of times this
> stupid test has oh so helpfully told me CONFIG_IDLE_PAGE_TRACKING is disabled,
> when in fact the problem is that I didn't run it as root.
>
> page_idle_fd = open("/sys/kernel/mm/page_idle/bitmap", O_RDWR);
> __TEST_REQUIRE(page_idle_fd >= 0,
> "CONFIG_IDLE_PAGE_TRACKING is not enabled");
>
>
> By all means, give the user a hint, but don't assume anything. E.g.
>
> page_idle_fd = open("/sys/kernel/mm/page_idle/bitmap", O_RDWR);
> __TEST_REQUIRE(page_idle_fd >= 0,
> "Failed to open blah blah blah, is CONFIG_IDLE_PAGE_TRACKING enabled?");
>
> > + TEST_ASSERT(read(lru_gen_fd, &mglru_feature_str, 7) > 0,
> > + "couldn't read lru_gen features");
> > + mglru_features = strtol(mglru_feature_str, NULL, 16);
> > + __TEST_REQUIRE(mglru_features & LRU_GEN_ENABLED,
> > + "lru_gen is not enabled");
> > + __TEST_REQUIRE(mglru_features & LRU_GEN_MM_WALK,
> > + "lru_gen does not support MM_WALK");
> > +
> > + lru_gen_debug_fd = open(DEBUGFS_LRU_GEN, O_RDWR);
> > + __TEST_REQUIRE(lru_gen_debug_fd >= 0,
> > + "Cannot access %s", DEBUGFS_LRU_GEN);
> > + close(lru_gen_debug_fd);
> > + }
> > +
> > + TEST_ASSERT(!params.benchmark_lru_gen || params.lru_gen,
> > + "-p specified without -l");
> > +
> > + if (params.lru_gen) {
> > + create_memcg(TEST_MEMCG_NAME);
> > + move_to_memcg(TEST_MEMCG_NAME, getpid());
>
> After this, cgroup_root is never used. Hint, hint.
>
> > + }
> >
> > - for_each_guest_mode(run_test, ¶ms);
> > + for_each_guest_mode(setup_vm_and_run, ¶ms);
> >
> > return 0;
> > }
> > diff --git a/tools/testing/selftests/kvm/include/lru_gen_util.h b/tools/testing/selftests/kvm/include/lru_gen_util.h
> > new file mode 100644
> > index 000000000000..4eef8085a3cb
> > --- /dev/null
> > +++ b/tools/testing/selftests/kvm/include/lru_gen_util.h
> > @@ -0,0 +1,55 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/*
> > + * Tools for integrating with lru_gen, like parsing the lru_gen debugfs output.
> > + *
> > + * Copyright (C) 2024, Google LLC.
> > + */
> > +#ifndef SELFTEST_KVM_LRU_GEN_UTIL_H
> > +#define SELFTEST_KVM_LRU_GEN_UTIL_H
> > +
> > +#include <inttypes.h>
> > +#include <limits.h>
> > +#include <stdlib.h>
> > +
> > +#include "test_util.h"
> > +
> > +#define MAX_NR_GENS 16 /* MAX_NR_GENS in include/linux/mmzone.h */
> > +#define MAX_NR_NODES 4 /* Maximum number of nodes we support */
>
> Who is "we"? KVM selftests? Kernel?
> > +
> > +static const char *DEBUGFS_LRU_GEN = "/sys/kernel/debug/lru_gen";
>
> So, root again?
>
> > +
> > +struct generation_stats {
> > + int gen;
> > + long age_ms;
> > + long nr_anon;
> > + long nr_file;
> > +};
> > +
> > +struct node_stats {
> > + int node;
> > + int nr_gens; /* Number of populated gens entries. */
> > + struct generation_stats gens[MAX_NR_GENS];
> > +};
> > +
> > +struct memcg_stats {
> > + unsigned long memcg_id;
> > + int nr_nodes; /* Number of populated nodes entries. */
> > + struct node_stats nodes[MAX_NR_NODES];
> > +};
> > +
> > +void print_memcg_stats(const struct memcg_stats *stats, const char *name);
> > +
> > +void read_memcg_stats(struct memcg_stats *stats, const char *memcg);
> > +
> > +void read_print_memcg_stats(struct memcg_stats *stats, const char *memcg);
>
> These need lru_gen_ namespacing. As is, they are very, very misleading names.
> They don't read core memcg stuff, they read LRU_GEN stuff, which AFAICT happens
> to be indexed by the memcg name.
>
> > +static void memcg_stats_handle_in_node(struct memcg_stats *stats,
> > + struct memcg_stats_parse_context *ctx,
> > + char *line)
> > +{
> > + /* Have to copy since we might not consume */
>
> Huh?
>
> > + char *my_line = strdup(line);
> > + struct split_iterator it = { .str = my_line };
> > + char *gen, *age, *nr_anon, *nr_file;
> > + struct node_stats *node_stats;
> > + struct generation_stats *gen_stats;
> > + char *end;
> > +
> > + TEST_ASSERT(it.str, "failed to copy input line");
> > +
> > + gen = split_next(&it);
> > +
> > + /* Skip empty lines */
> > + if (!gen)
> > + goto out_consume; /* Skip empty lines */
>
> If you say it three times, it might happen! (Can you tell it's Friday afternoon?)
>
> > + if (!strcmp("memcg", gen) || !strcmp("node", gen)) {
> > + /*
> > + * Reached next memcg or node section. Don't consume, let the
> > + * other handler deal with this.
> > + */
> > + ctx->next_handler = memcg_stats_handle_in_memcg;
> > + goto out;
> > + }
>
> ...
>
> > +void print_memcg_stats(const struct memcg_stats *stats, const char *name)
> > +{
> > + int node, gen;
> > +
> > + fprintf(stderr, "stats for memcg %s (id %lu):\n",
>
> Why stderr? This is effectively wrapped with DEBUG, so why not pr_debug()?
>
> > + name, stats->memcg_id);
> > + for (node = 0; node < stats->nr_nodes; ++node) {
> > + fprintf(stderr, "\tnode %d\n", stats->nodes[node].node);
> > + for (gen = 0; gen < stats->nodes[node].nr_gens; ++gen) {
> > + const struct generation_stats *gstats =
> > + &stats->nodes[node].gens[gen];
> > +
> > + fprintf(stderr,
> > + "\t\tgen %d\tage_ms %ld"
> > + "\tnr_anon %ld\tnr_file %ld\n",
> > + gstats->gen, gstats->age_ms, gstats->nr_anon,
> > + gstats->nr_file);
> > + }
> > + }
> > +}
> > +
> > +/* Re-read lru_gen debugfs information for @memcg into @stats. */
> > +void read_memcg_stats(struct memcg_stats *stats, const char *memcg)
> > +{
> > + FILE *f;
> > + ssize_t read = 0;
> > + char *line = NULL;
> > + size_t bufsz;
> > + struct memcg_stats_parse_context ctx = {
> > + .next_handler = memcg_stats_handle_searching,
> > + .name = memcg,
> > + };
> > +
> > + memset(stats, 0, sizeof(struct memcg_stats));
> > +
> > + f = fopen(DEBUGFS_LRU_GEN, "r");
> > + TEST_ASSERT(f, "fopen(%s) failed", DEBUGFS_LRU_GEN);
> > +
> > + while (ctx.next_handler && (read = getline(&line, &bufsz, f)) > 0) {
> > + ctx.consumed = false;
> > +
> > + do {
> > + ctx.next_handler(stats, &ctx, line);
> > + if (!ctx.next_handler)
> > + break;
> > + } while (!ctx.consumed);
> > + }
> > +
> > + if (read < 0 && !feof(f))
> > + TEST_ASSERT(false, "getline(%s) failed", DEBUGFS_LRU_GEN);
> > +
> > + TEST_ASSERT(stats->memcg_id > 0, "Couldn't find memcg: %s\n"
> > + "Did the memcg get created in the proper mount?",
> > + memcg);
> > + if (line)
> > + free(line);
> > + TEST_ASSERT(!fclose(f), "fclose(%s) failed", DEBUGFS_LRU_GEN);
> > +}
> > +
> > +/*
> > + * Find all pages tracked by lru_gen for this memcg in generation @target_gen.
> > + *
> > + * If @target_gen is negative, look for all generations.
> > + */
> > +static long sum_memcg_stats_for_gen(int target_gen,
> > + const struct memcg_stats *stats)
> > +{
> > + int node, gen;
> > + long total_nr = 0;
> > +
> > + for (node = 0; node < stats->nr_nodes; ++node) {
> > + const struct node_stats *node_stats = &stats->nodes[node];
> > +
> > + for (gen = 0; gen < node_stats->nr_gens; ++gen) {
> > + const struct generation_stats *gen_stats =
> > + &node_stats->gens[gen];
> > +
> > + if (target_gen >= 0 && gen_stats->gen != target_gen)
> > + continue;
> > +
> > + total_nr += gen_stats->nr_anon + gen_stats->nr_file;
> > + }
> > + }
> > +
> > + return total_nr;
> > +}
> > +
> > +/* Find all pages tracked by lru_gen for this memcg. */
> > +long sum_memcg_stats(const struct memcg_stats *stats)
> > +{
> > + return sum_memcg_stats_for_gen(-1, stats);
> > +}
> > +
> > +/* Read the memcg stats and optionally print if this is a debug build. */
> > +void read_print_memcg_stats(struct memcg_stats *stats, const char *memcg)
> > +{
> > + read_memcg_stats(stats, memcg);
> > +#ifdef DEBUG
> > + print_memcg_stats(stats, memcg);
>
> print_memcg_stats() should be static, because this is the only caller. But I'm
> guessing you made it globally visible so that the compiler wouldn't complain
> about unused functions. A better approach would be to wrap the guts with the
> #ifdef.
>
> > +#endif
> > +}
> > +
> > +/*
> > + * If lru_gen aging should force page table scanning.
> > + *
> > + * If you want to set this to false, you will need to do eviction
> > + * before doing extra aging passes.
> > + */
> > +static const bool force_scan = true;
> > +
> > +static void run_aging_impl(unsigned long memcg_id, int node_id, int max_gen)
> > +{
> > + FILE *f = fopen(DEBUGFS_LRU_GEN, "w");
> > + char *command;
> > + size_t sz;
> > +
> > + TEST_ASSERT(f, "fopen(%s) failed", DEBUGFS_LRU_GEN);
> > + sz = asprintf(&command, "+ %lu %d %d 1 %d\n",
> > + memcg_id, node_id, max_gen, force_scan);
> > + TEST_ASSERT(sz > 0, "creating aging command failed");
> > +
> > + pr_debug("Running aging command: %s", command);
> > + if (fwrite(command, sizeof(char), sz, f) < sz) {
> > + TEST_ASSERT(false, "writing aging command %s to %s failed",
> > + command, DEBUGFS_LRU_GEN);
> > + }
> > +
> > + TEST_ASSERT(!fclose(f), "fclose(%s) failed", DEBUGFS_LRU_GEN);
> > +}
> > +
> > +static void _lru_gen_do_aging(struct memcg_stats *stats, const char *memcg,
>
> Two underscores.
>
> > +/* Do aging, and print how long it took. */
> > +void lru_gen_do_aging(struct memcg_stats *stats, const char *memcg)
> > +{
> > + return _lru_gen_do_aging(stats, memcg, true);
> > +}
> > +
> > +/* Do aging, don't print anything. */
> > +void lru_gen_do_aging_quiet(struct memcg_stats *stats, const char *memcg)
> > +{
> > + return _lru_gen_do_aging(stats, memcg, false);
>
> static inline helpers in the header? Having to come all this way to see that
> these are simple wrapper is annoying.
>
> > +}
> > +
> > +/*
> > + * Find which generation contains more than half of @total_pages, assuming that
> > + * such a generation exists.
> > + */
> > +int lru_gen_find_generation(const struct memcg_stats *stats,
> > + unsigned long total_pages)
> > +{
> > + int node, gen, gen_idx, min_gen = INT_MAX, max_gen = -1;
> > +
> > + for (node = 0; node < stats->nr_nodes; ++node)
>
> Curly braces.
>
> > + for (gen_idx = 0; gen_idx < stats->nodes[node].nr_gens;
> > + ++gen_idx) {
> > + gen = stats->nodes[node].gens[gen_idx].gen;
> > + max_gen = gen > max_gen ? gen : max_gen;
> > + min_gen = gen < min_gen ? gen : min_gen;
> > + }
> > +
> > + for (gen = min_gen; gen < max_gen; ++gen)
> > + /* See if the most pages are in this generation. */
> > + if (sum_memcg_stats_for_gen(gen, stats) >
> > + total_pages / 2)
> > + return gen;
> > +
> > + TEST_ASSERT(false, "No generation includes majority of %lu pages.",
>
> TEST_FAIL.
>
> > + total_pages);
> > +
> > + /* unreachable, but make the compiler happy */
> > + return -1;
>
> I _think_ selftests can use unreachable().
>
> > +}
> > --
> > 2.47.0.199.ga7371fff76-goog
> >
Thanks for all the feedback. I'll apply all of it (and Yosry's
suggestion to use cgroup_util.h / otherwise de-duplicate the cgroup
logic, thanks Yosry!) for the next version of this patch. For now I'll
go ahead and send v9 of the main series.