RE: [PATCH v1 2/2] mm: zswap: zswap_store_pages() simplifications for batching.
From: Sridhar, Kanchana P
Date: Mon Dec 02 2024 - 20:02:01 EST
Hi Chengming, Yosry,
> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
> Sent: Monday, December 2, 2024 11:33 AM
> To: Chengming Zhou <chengming.zhou@xxxxxxxxx>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>; linux-
> kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; hannes@xxxxxxxxxxx;
> nphamcs@xxxxxxxxx; usamaarif642@xxxxxxxxx; ryan.roberts@xxxxxxx;
> 21cnbao@xxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; Feghali, Wajdi K
> <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx>
> Subject: Re: [PATCH v1 2/2] mm: zswap: zswap_store_pages() simplifications
> for batching.
>
> On Wed, Nov 27, 2024 at 11:00 PM Chengming Zhou
> <chengming.zhou@xxxxxxxxx> wrote:
> >
> > On 2024/11/28 06:53, Kanchana P Sridhar wrote:
> > > In order to set up zswap_store_pages() to enable a clean batching
> > > implementation in [1], this patch implements the following changes:
> > >
> > > 1) Addition of zswap_alloc_entries() which will allocate zswap entries for
> > > all pages in the specified range for the folio, upfront. If this fails,
> > > we return an error status to zswap_store().
> > >
> > > 2) Addition of zswap_compress_pages() that calls zswap_compress() for
> each
> > > page, and returns false if any zswap_compress() fails, so
> > > zswap_store_page() can cleanup resources allocated and return an
> error
> > > status to zswap_store().
> > >
> > > 3) A "store_pages_failed" label that is a catch-all for all failure points
> > > in zswap_store_pages(). This facilitates cleaner error handling within
> > > zswap_store_pages(), which will become important for IAA compress
> > > batching in [1].
> > >
> > > [1]: https://patchwork.kernel.org/project/linux-mm/list/?series=911935
> > >
> > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@xxxxxxxxx>
> > > ---
> > > mm/zswap.c | 93 +++++++++++++++++++++++++++++++++++++++++----
> ---------
> > > 1 file changed, 71 insertions(+), 22 deletions(-)
> > >
> > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > index b09d1023e775..db80c66e2205 100644
> > > --- a/mm/zswap.c
> > > +++ b/mm/zswap.c
> > > @@ -1409,9 +1409,56 @@ static void shrink_worker(struct work_struct
> *w)
> > > * main API
> > > **********************************/
> > >
> > > +static bool zswap_compress_pages(struct page *pages[],
> > > + struct zswap_entry *entries[],
> > > + u8 nr_pages,
> > > + struct zswap_pool *pool)
> > > +{
> > > + u8 i;
> > > +
> > > + for (i = 0; i < nr_pages; ++i) {
> > > + if (!zswap_compress(pages[i], entries[i], pool))
> > > + return false;
> > > + }
> > > +
> > > + return true;
> > > +}
> >
> > How about introducing a `zswap_compress_folio()` interface which
> > can be used by `zswap_store()`?
> > ```
> > zswap_store()
> > nr_pages = folio_nr_pages(folio)
> >
> > entries = zswap_alloc_entries(nr_pages)
> >
> > ret = zswap_compress_folio(folio, entries, pool)
> >
> > // store entries into xarray and LRU list
> > ```
> >
> > And this version `zswap_compress_folio()` is very simple for now:
> > ```
> > zswap_compress_folio()
> > nr_pages = folio_nr_pages(folio)
> >
> > for (index = 0; index < nr_pages; ++index) {
> > struct page *page = folio_page(folio, index);
> >
> > if (!zswap_compress(page, &entries[index], pool))
> > return false;
> > }
> >
> > return true;
> > ```
> > This can be easily extended to support your "batched" version.
> >
> > Then the old `zswap_store_page()` could be removed.
> >
> > The good point is simplicity, that we don't need to slice folio
> > into multiple batches, then repeat the common operations for each
> > batch, like preparing entries, storing into xarray and LRU list...
>
> +1
Thanks for the code review comments. One question though: would
it make sense to trade-off the memory footprint cost with the code
simplification? For instance, lets say we want to store a 64k folio.
We would allocate memory for 16 zswap entries, and lets say one of
the compressions fails, we would deallocate memory for all 16 zswap
entries. Could this lead to zswap_entry kmem_cache starvation and
subsequent zswap_store() failures in multiple processes scenarios?
In other words, allocating entries in smaller batches -- more specifically,
only the compress batchsize -- seems to strike a balance in terms of
memory footprint, while mitigating the starvation aspect, and possibly
also helping latency (allocating a large # of zswap entries and potentially
deallocating, could impact latency).
If we agree with the merits of processing a large folio in smaller batches:
this in turn requires we store the smaller batches of entries in the
xarray/LRU before moving to the next batch. Which means all the
zswap_store() ops need to be done for a batch before moving to the next
batch.
>
> Also, I don't like the helpers hiding some of the loops and leaving
> others, as Johannes said, please keep all the iteration over pages at
> the same function level where possible to make the code clear.
Sure. I can either inline all the loops into zswap_store_pages(), or convert
all iterations into helpers with a consistent signature:
zswap_<proc_name>(arrayed_struct, nr_pages);
Please let me know which would work best. Thanks!
>
> This should not be a separate series too, when I said divide into
> chunks I meant leave out the multiple folios batching and focus on
> batching pages in a single large folio, not breaking down the series
> into multiple ones. Not a big deal tho :)
I understand. I am trying to de-couple and develop in parallel the
following, which I intend to converge into a v5 of the original series [1]:
a) Vectorization, followed by batching of zswap_store() of large folios.
b) acomp request chaining suggestions from Herbert, which could
change the existing v4 implementation of the
crypto_acomp_batch_compress() API that zswap would need to
call for IAA compress batching.
[1]: https://patchwork.kernel.org/project/linux-mm/list/?series=911935
Thanks,
Kanchana
>
> >
> > Thanks.