Re: Augmented Page Reclaim

From: Matthew Wilcox
Date: Tue Feb 02 2021 - 07:19:07 EST



It's hard to know which 'note' refers to which reference. Here's
my attempt to figure that out.

On Tue, Feb 02, 2021 at 01:57:15AM -0700, Yu Zhao wrote:

> Versatility
> ===========
> Userspace can trigger aging and eviction independently via the
> ``debugfs`` interface [note]_ for working set estimation, proactive

1. `Long-term SLOs for reclaimed cloud computing resources
<https://research.google/pubs/pub43017/>`_

> reclaim, far memory tiering, NUMA-aware job scheduling, etc. The
> metrics from the interface are easily interpretable, which allows
> intuitive provisioning and discoveries like the Borg example above.
> For a warehouse-scale computer, the interface is intended to be a
> building block of a closed-loop control system, with a machine
> learning algorithm being the controller.
>
> Simplicity
> ==========
> The workflow [note]_ is well defined and each step in it has a clear

2. `Profiling a warehouse-scale computer
<https://research.google/pubs/pub44271/>`_

> meaning. There are no magic numbers or heuristics involved but a few
> basic data structures that have negligible memory footprint. This
> simplicity has served us well as the scale and the diversity of our
> workloads constantly grow.
[...]
> FAQ
> ===
> What is the motivation for this work?
> -------------------------------------
> In our case, DRAM is a major factor in total cost of ownership, and
> improving memory overcommit brings a high return on investment.
> Moreover, Google-Wide Profiling has been observing the high CPU
> overhead [note]_ from page reclaim.

3. `Evaluation of NUMA-Aware Scheduling in Warehouse-Scale Clusters
<https://research.google/pubs/pub48329/>`_

> Why not try to improve the existing code?
> -----------------------------------------
> We have tried but concluded the two limiting factors [note]_ in the

4. `Software-defined far memory in warehouse-scale computers
<https://research.google/pubs/pub48551/>`_

> existing code are fundamental, and therefore changes made atop them
> will not result in substantial gains on any of the aspects above.
>
> What particular workloads does it help?
> ---------------------------------------
> This work optimizes page reclaim for workloads that are not IO bound,
> because we find they are the norm on servers and clients in the cloud
> era. It would most likely help any workloads that share the common
> characteristics [note]_ we observed.

5. `Borg: the Next Generation
<https://research.google/pubs/pub49065/>`_