Re: [PATCH 1/2] mm: store zero pages to be swapped out in a bitmap

From: Nhat Pham
Date: Thu May 30 2024 - 15:18:43 EST


On Thu, May 30, 2024 at 9:24 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
>
> On Thu, May 30, 2024 at 5:27 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> >
> > On Thu, May 30, 2024 at 11:19:07AM +0100, Usama Arif wrote:
> > > Approximately 10-20% of pages to be swapped out are zero pages [1].
> > > Rather than reading/writing these pages to flash resulting
> > > in increased I/O and flash wear, a bitmap can be used to mark these
> > > pages as zero at write time, and the pages can be filled at
> > > read time if the bit corresponding to the page is set.
> > > With this patch, NVMe writes in Meta server fleet decreased
> > > by almost 10% with conventional swap setup (zswap disabled).
> > >
> > > [1]https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
> > >
> > > Signed-off-by: Usama Arif <usamaarif642@xxxxxxxxx>
> >
> > This is awesome.
> >
> > > ---
> > > include/linux/swap.h | 1 +
> > > mm/page_io.c | 86 ++++++++++++++++++++++++++++++++++++++++++--
> > > mm/swapfile.c | 10 ++++++
> > > 3 files changed, 95 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > index a11c75e897ec..e88563978441 100644
> > > --- a/include/linux/swap.h
> > > +++ b/include/linux/swap.h
> > > @@ -299,6 +299,7 @@ struct swap_info_struct {
> > > signed char type; /* strange name for an index */
> > > unsigned int max; /* extent of the swap_map */
> > > unsigned char *swap_map; /* vmalloc'ed array of usage counts */
> > > + unsigned long *zeromap; /* vmalloc'ed bitmap to track zero pages */
> >
> > One bit per swap slot, so 1 / (4096 * 8) = 0.003% static memory
> > overhead for configured swap space. That seems reasonable for what
> > appears to be a fairly universal 10% reduction in swap IO.
> >
> > An alternative implementation would be to reserve a bit in
> > swap_map. This would be no overhead at idle, but would force
> > continuation counts earlier on heavily shared page tables, and AFAICS
> > would get complicated in terms of locking, whereas this one is pretty
> > simple (atomic ops protect the map, swapcache lock protects the bit).
> >
> > So I prefer this version. But a few comments below:
>
> I am wondering if it's even possible to take this one step further and
> avoid reclaiming zero-filled pages in the first place. Can we just
> unmap them and let the first read fault allocate a zero'd page like
> uninitialized memory, or point them at the zero page and make them
> read-only, or something? Then we could free them directly without
> going into the swap code to begin with.
>
> That's how I thought about it initially when I attempted to support
> only zero-filled pages in zswap. It could be a more complex
> implementation though.

We can aim for this eventually, but yeah the implementation will be
more complex. We'll need to be careful in handling shared zero pages,
synchronizing accesses and maintaining reference counts. I think we
will need to special-case swap cache and swap map for these zero pages
(a ghost zero swap device perhaps), or reinvent the wheel to manage
these pieces of information.

Not impossible, but annoying :) For now, I think Usama's approach is
clean enough and does the job.