Re: [PATCH v6 8/9] mm: multigenerational lru: user interface

From: Mike Rapoport
Date: Mon Jan 10 2022 - 05:27:39 EST


Hi,

On Tue, Jan 04, 2022 at 01:22:27PM -0700, Yu Zhao wrote:
> Add /sys/kernel/mm/lru_gen/enabled as a runtime kill switch.
>
> Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention.
> Compared with the size-based approach, e.g., [1], this time-based
> approach has the following advantages:
> 1) It's easier to configure because it's agnostic to applications and
> memory sizes.
> 2) It's more reliable because it's directly wired to the OOM killer.
>
> Add /sys/kernel/debug/lru_gen for working set estimation and proactive
> reclaim. Compared with the page table-based approach and the PFN-based
> approach, e.g., mm/damon/[vp]addr.c, this lruvec-based approach has
> the following advantages:
> 1) It offers better choices because it's aware of memcgs, NUMA nodes,
> shared mappings and unmapped page cache.
> 2) It's more scalable because it's O(nr_hot_evictable_pages), whereas
> the PFN-based approach is O(nr_total_pages).
>
> Add /sys/kernel/debug/lru_gen_full for debugging.
>
> [1] https://lore.kernel.org/lkml/20211130201652.2218636d@xxxxxxxxxxxxx/
>
> Signed-off-by: Yu Zhao <yuzhao@xxxxxxxxxx>
> Tested-by: Konstantin Kharlamov <Hi-Angel@xxxxxxxxx>
> ---
> Documentation/vm/index.rst | 1 +
> Documentation/vm/multigen_lru.rst | 62 +++++

The description of user visible interfaces should go to
Documentation/admin-guide/mm

Documentation/vm/multigen_lru.rst should have contained design description
and the implementation details and it would be great to actually have such
document.

> include/linux/nodemask.h | 1 +
> mm/vmscan.c | 415 ++++++++++++++++++++++++++++++
> 4 files changed, 479 insertions(+)
> create mode 100644 Documentation/vm/multigen_lru.rst
>
> diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
> index 6f5ffef4b716..f25e755b4ff4 100644
> --- a/Documentation/vm/index.rst
> +++ b/Documentation/vm/index.rst
> @@ -38,3 +38,4 @@ algorithms. If you are looking for advice on simply allocating memory, see the
> unevictable-lru
> z3fold
> zsmalloc
> + multigen_lru
> diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst
> new file mode 100644
> index 000000000000..6f9e0181348b
> --- /dev/null
> +++ b/Documentation/vm/multigen_lru.rst
> @@ -0,0 +1,62 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================
> +Multigenerational LRU
> +=====================
> +
> +Quick start
> +===========
> +Runtime configurations
> +----------------------
> +:Required: Write ``1`` to ``/sys/kernel/mm/lru_gen/enable`` if the
> + feature wasn't enabled by default.

Required for what? This sentence seem to lack context. Maybe add an
overview what is Multigenerational LRU so that users will have an idea what
these knobs control.

> +
> +Recipes
> +=======

Some more context here will be also helpful.

> +Personal computers
> +------------------
> +:Thrashing prevention: Write ``N`` to
> + ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of
> + ``N`` milliseconds from getting evicted. The OOM killer is invoked if
> + this working set can't be kept in memory. Based on the average human
> + detectable lag (~100ms), ``N=1000`` usually eliminates intolerable
> + lags due to thrashing. Larger values like ``N=3000`` make lags less
> + noticeable at the cost of more OOM kills.
> +
> +Data centers
> +------------
> +:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following
> + format:
> + ::
> +
> + memcg memcg_id memcg_path
> + node node_id
> + min_gen birth_time anon_size file_size
> + ...
> + max_gen birth_time anon_size file_size
> +
> + ``min_gen`` is the oldest generation number and ``max_gen`` is the
> + youngest generation number. ``birth_time`` is in milliseconds.
> + ``anon_size`` and ``file_size`` are in pages.

And what does oldest and youngest generations mean from the user
perspective?

> +
> + This file also accepts commands in the following subsections.
> + Multiple command lines are supported, so does concatenation with
> + delimiters ``,`` and ``;``.
> +
> + ``/sys/kernel/debug/lru_gen_full`` contains additional stats for
> + debugging.
> +
> +:Working set estimation: Write ``+ memcg_id node_id max_gen
> + [can_swap [full_scan]]`` to ``/sys/kernel/debug/lru_gen`` to trigger
> + the aging. It scans PTEs for accessed pages and promotes them to the
> + youngest generation ``max_gen``. Then it creates a new generation
> + ``max_gen+1``. Set ``can_swap`` to 1 to scan for accessed anon pages
> + when swap is off. Set ``full_scan`` to 0 to reduce the overhead as
> + well as the coverage when scanning PTEs.
> +
> +:Proactive reclaim: Write ``- memcg_id node_id min_gen [swappiness
> + [nr_to_reclaim]]`` to ``/sys/kernel/debug/lru_gen`` to trigger the
> + eviction. It evicts generations less than or equal to ``min_gen``.
> + ``min_gen`` should be less than ``max_gen-1`` as ``max_gen`` and
> + ``max_gen-1`` aren't fully aged and therefore can't be evicted. Use
> + ``nr_to_reclaim`` to limit the number of pages to evict.

...

--
Sincerely yours,
Mike.