Re: Augmented Page Reclaim

From: Yu Zhao
Date: Tue Feb 02 2021 - 14:40:44 EST


On Tue, Feb 02, 2021 at 12:17:08PM +0000, Matthew Wilcox wrote:
>
> It's hard to know which 'note' refers to which reference. Here's
> my attempt to figure that out.

Sorry for the trouble. [note]_ links to

.. [note] See ``Documentation/vm/multigen-lru.rst`` in the tree.

which has nothing to do with the references listed at the bottom.

The references are helpful but not required to process the information
in this email or the doc above.

Let me attach PDF files generated my first email (intro.pdf) and the
doc (man.pdf). They are better formatted.

>
> On Tue, Feb 02, 2021 at 01:57:15AM -0700, Yu Zhao wrote:
>
> > Versatility
> > ===========
> > Userspace can trigger aging and eviction independently via the
> > ``debugfs`` interface [note]_ for working set estimation, proactive
>
> 1. `Long-term SLOs for reclaimed cloud computing resources
> <https://research.google/pubs/pub43017/>`_
>
> > reclaim, far memory tiering, NUMA-aware job scheduling, etc. The
> > metrics from the interface are easily interpretable, which allows
> > intuitive provisioning and discoveries like the Borg example above.
> > For a warehouse-scale computer, the interface is intended to be a
> > building block of a closed-loop control system, with a machine
> > learning algorithm being the controller.
> >
> > Simplicity
> > ==========
> > The workflow [note]_ is well defined and each step in it has a clear
>
> 2. `Profiling a warehouse-scale computer
> <https://research.google/pubs/pub44271/>`_
>
> > meaning. There are no magic numbers or heuristics involved but a few
> > basic data structures that have negligible memory footprint. This
> > simplicity has served us well as the scale and the diversity of our
> > workloads constantly grow.
> [...]
> > FAQ
> > ===
> > What is the motivation for this work?
> > -------------------------------------
> > In our case, DRAM is a major factor in total cost of ownership, and
> > improving memory overcommit brings a high return on investment.
> > Moreover, Google-Wide Profiling has been observing the high CPU
> > overhead [note]_ from page reclaim.
>
> 3. `Evaluation of NUMA-Aware Scheduling in Warehouse-Scale Clusters
> <https://research.google/pubs/pub48329/>`_
>
> > Why not try to improve the existing code?
> > -----------------------------------------
> > We have tried but concluded the two limiting factors [note]_ in the
>
> 4. `Software-defined far memory in warehouse-scale computers
> <https://research.google/pubs/pub48551/>`_
>
> > existing code are fundamental, and therefore changes made atop them
> > will not result in substantial gains on any of the aspects above.
> >
> > What particular workloads does it help?
> > ---------------------------------------
> > This work optimizes page reclaim for workloads that are not IO bound,
> > because we find they are the norm on servers and clients in the cloud
> > era. It would most likely help any workloads that share the common
> > characteristics [note]_ we observed.
>
> 5. `Borg: the Next Generation
> <https://research.google/pubs/pub49065/>`_
>

Attachment: intro.pdf
Description: Adobe PDF document

Attachment: man.pdf
Description: Adobe PDF document