Re: [PATCH 1/2] mm: store zero pages to be swapped out in a bitmap

From: Yosry Ahmed
Date: Thu May 30 2024 - 12:20:58 EST


On Thu, May 30, 2024 at 3:21 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
>
> Approximately 10-20% of pages to be swapped out are zero pages [1].
> Rather than reading/writing these pages to flash resulting
> in increased I/O and flash wear, a bitmap can be used to mark these
> pages as zero at write time, and the pages can be filled at
> read time if the bit corresponding to the page is set.
> With this patch, NVMe writes in Meta server fleet decreased
> by almost 10% with conventional swap setup (zswap disabled).

Great work. I thought about doing this after my attempt to drop some
of the same-filled pages handling in zswap, now we can drop all of it
:)

Make sure to CC other non-zswap folks on next iterations like Ying. I
see Johannes already did that in his response. I suspect
get_maintainers will give you a few extra names.

>
> [1]https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
>
> Signed-off-by: Usama Arif <usamaarif642@xxxxxxxxx>
> ---
> include/linux/swap.h | 1 +
> mm/page_io.c | 86 ++++++++++++++++++++++++++++++++++++++++++--
> mm/swapfile.c | 10 ++++++
> 3 files changed, 95 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index a11c75e897ec..e88563978441 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -299,6 +299,7 @@ struct swap_info_struct {
> signed char type; /* strange name for an index */
> unsigned int max; /* extent of the swap_map */
> unsigned char *swap_map; /* vmalloc'ed array of usage counts */
> + unsigned long *zeromap; /* vmalloc'ed bitmap to track zero pages */
> struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
> struct swap_cluster_list free_clusters; /* free clusters list */
> unsigned int lowest_bit; /* index of first free in swap_map */
> diff --git a/mm/page_io.c b/mm/page_io.c
> index a360857cf75d..ab043b4ad577 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -172,6 +172,77 @@ int generic_swapfile_activate(struct swap_info_struct *sis,
> goto out;
> }
>
> +static bool is_folio_page_zero_filled(struct folio *folio, int i)
> +{
> + unsigned long *page;

I just recently renamed this variable in the zswap version of this
function because it is very confusing, especially when looking for
struct page references. 'page' is usually used for struct page. Let's
use a different name here.

> + unsigned int pos;
> + bool ret = false;
> +
> + page = kmap_local_folio(folio, i * PAGE_SIZE);

In the zswap version, we compare against the end of the page first
before the loop, in case the page just has a bunch of zeros at its
beginning. The patch that added it to zswap reported better
performance in some cases [1].

You can also use memchr_inv() to compare the range against 0 instead
of the loop.

[1]https://lore.kernel.org/all/20230205190036.1730134-1-taejoon.song@xxxxxxx/

> + for (pos = 0; pos < PAGE_SIZE / sizeof(*page); pos++) {
> + if (page[pos] != 0)
> + goto out;
> + }
> + ret = true;
> +out:
> + kunmap_local(page);
> + return ret;
> +}
> +
> +static bool is_folio_zero_filled(struct folio *folio)
> +{
> + unsigned int i;
> +
> + for (i = 0; i < folio_nr_pages(folio); i++) {
> + if (!is_folio_page_zero_filled(folio, i))
> + return false;
> + }
> + return true;
> +}

Is there value in splitting this into two functions and having a
separate loop here? Can we just have a single function that kmaps the
entire folio and loops over it in one go?

> +
> +static void folio_page_zero_fill(struct folio *folio, int i)
> +{
> + unsigned long *page;
> +
> + page = kmap_local_folio(folio, i * PAGE_SIZE);
> + memset_l(page, 0, PAGE_SIZE / sizeof(unsigned long));
> + kunmap_local(page);
> +}
> +
> +static void folio_zero_fill(struct folio *folio)
> +{
> + unsigned int i;
> +
> + for (i = 0; i < folio_nr_pages(folio); i++)
> + folio_page_zero_fill(folio, i);

I think you can just use clear_highpage() here and drop
folio_page_zero_fill(). It should be more optimized as well in some
cases.

I don't have further comments about the rest beyond Johannes's comments.