RE: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if the crypto_alg supports batching.

From: Sridhar, Kanchana P
Date: Tue Dec 03 2024 - 17:20:19 EST



> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
> Sent: Tuesday, December 3, 2024 1:44 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
> Cc: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx>; Nhat Pham
> <nphamcs@xxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
> hannes@xxxxxxxxxxx; chengming.zhou@xxxxxxxxx;
> usamaarif642@xxxxxxxxx; ryan.roberts@xxxxxxx; ying.huang@xxxxxxxxx;
> 21cnbao@xxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; linux-
> crypto@xxxxxxxxxxxxxxx; davem@xxxxxxxxxxxxx; clabbe@xxxxxxxxxxxx;
> ardb@xxxxxxxxxx; ebiggers@xxxxxxxxxx; surenb@xxxxxxxxxx; Accardi,
> Kristen C <kristen.c.accardi@xxxxxxxxx>; Feghali, Wajdi K
> <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx>
> Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if
> the crypto_alg supports batching.
>
> On Tue, Dec 3, 2024 at 1:37 PM Sridhar, Kanchana P
> <kanchana.p.sridhar@xxxxxxxxx> wrote:
> >
> >
> > > -----Original Message-----
> > > From: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx>
> > > Sent: Tuesday, December 3, 2024 12:01 AM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
> > > Cc: Nhat Pham <nphamcs@xxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx;
> linux-
> > > mm@xxxxxxxxx; hannes@xxxxxxxxxxx; yosryahmed@xxxxxxxxxx;
> > > chengming.zhou@xxxxxxxxx; usamaarif642@xxxxxxxxx;
> > > ryan.roberts@xxxxxxx; ying.huang@xxxxxxxxx; 21cnbao@xxxxxxxxx;
> > > akpm@xxxxxxxxxxxxxxxxxxxx; linux-crypto@xxxxxxxxxxxxxxx;
> > > davem@xxxxxxxxxxxxx; clabbe@xxxxxxxxxxxx; ardb@xxxxxxxxxx;
> > > ebiggers@xxxxxxxxxx; surenb@xxxxxxxxxx; Accardi, Kristen C
> > > <kristen.c.accardi@xxxxxxxxx>; Feghali, Wajdi K
> <wajdi.k.feghali@xxxxxxxxx>;
> > > Gopal, Vinodh <vinodh.gopal@xxxxxxxxx>
> > > Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching
> resources if
> > > the crypto_alg supports batching.
> > >
> > > On Tue, Dec 03, 2024 at 12:30:30AM +0000, Sridhar, Kanchana P wrote:
> > > >
> > > > > Why do we need this "can_batch" field? IIUC, this can be determined
> > > > > from the compressor internal fields itself, no?
> > > > >
> > > > > acomp_has_async_batching(acomp);
> > > > >
> > > > > Is this just for convenience, or is this actually an expensive thing to
> > > compute?
> > > >
> > > > Thanks for your comments. This is a good question. I tried not to imply
> that
> > > > batching resources have been allocated for the cpu based only on what
> > > > acomp_has_async_batching() returns. It is possible that the cpu onlining
> > > > code ran into an -ENOMEM error on any particular cpu. In this case, I
> set
> > > > the pool->can_batch to "false", mainly for convenience, so that zswap
> > > > can be somewhat insulated from migration. I agree that this may not be
> > > > the best solution; and whether or not batching is enabled can be directly
> > > > determined just before the call to crypto_acomp_batch_compress()
> > > > based on:
> > > >
> > > > acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE;
> > >
> > > With ahash request chaining, the idea is to accumulate as much
> > > data as you can before you provide it to the Crypto API. The
> > > API is responsible for dividing it up if the underlying driver
> > > is only able to handle one request at a time.
> > >
> > > So that would be the ideal model to use for compression as well.
> > > Provide as much data as you can and let the API handle the case
> > > where the data needs to be divided up.
> >
> > Thanks for this suggestion! This sounds like a clean way to handle the
> > batching/sequential compress/decompress within the crypto API as long
> > as it can be contained in the crypto acompress layer.
> > If the zswap maintainers don't have any objections, I can look into the
> > feasibility of doing this.
>
> Does this mean that instead of zswap breaking down the folio into
> SWAP_CRYPTO_BATCH_SIZE -sized batches, we pass all the pages to the
> crypto layer and let it do the batching as it pleases?

If I understand Herbert's suggestion correctly, I think what he meant was
that we allocate only SWAP_CRYPTO_BATCH_SIZE # of buffers in zswap (say, 8)
during the cpu onlining always. The acomp_has_async_batching() API can
be used to determine whether to allocate more than one acomp_req and
crypto_wait (fyi, I am creating SWAP_CRYPTO_BATCH_SIZE # of crypto_wait
for the request chaining with the goal of understanding performance wrt the
existing implementation of crypto_acomp_batch_compress()).
In zswap_store_folio(), we process the large folio in batches of 8 pages
and call "crypto_acomp_batch_compress()" for each batch. Based on earlier
discussions in this thread, it might make sense to add a bool option to
crypto_acomp_batch_compress() as follows:

static inline bool crypto_acomp_batch_compress(struct acomp_req *reqs[],
struct crypto_wait *waits[],
struct page *pages[],
u8 *dsts[],
unsigned int dlens[],
int errors[],
int nr_pages,
bool parallel);

zswap would acquire the per-cpu acomp_ctx->mutex, and pass
(acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE) for the "parallel" parameter.
This indicates to crypto_acomp_batch_compress() whether or not
SWAP_CRYPTO_BATCH_SIZE # of elements are available in "reqs" and "waits".

If we have multiple "reqs" (parallel == true), we use request chaining (or the
existing asynchronous poll implementation) for IAA batching. If (parallel == false),
crypto_acomp_batch_compress() will look something like this:

static inline bool crypto_acomp_batch_compress(struct acomp_req *reqs[],
struct crypto_wait *waits[],
struct page *pages[],
u8 *dsts[],
unsigned int dlens[],
int errors[],
int nr_pages,
bool parallel)
{
if (!parallel) {
struct scatterlist input, output;
int i;

for (i = 0; i < nr_pages; ++i) {
/* for pages[i], buffers[i], dlens[i]: borrow first half of
* zswap_compress() functionality:
*/
dst = acomp_ctx->buffers[i];
sg_init_table(&input, 1);
sg_set_page(&input, pages[i], PAGE_SIZE, 0);

sg_init_one(&output, dst, PAGE_SIZE * 2);
acomp_request_set_params(acomp_ctx->reqs[0], &input, &output, PAGE_SIZE, dlens[i]);

comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), acomp_ctx->waits[0]);
dlens[i] = acomp_ctx->reqs[0]->dlen;
}
}

/*
* At this point we would have sequentially compressed the batch.
* zswap_store_folio() can process the buffers and dlens using
* common code for batching and non-batching compressors.
*/
}

IIUC, this suggestion appears to be along the lines of using common
code in zswap as far as possible, for compressors that do and do not
support batching. Herbert can correct me if I am wrong.

If this is indeed the case, the memory penalty for software compressors
would be:
1) pre-allocating SWAP_CRYPTO_BATCH_SIZE acomp_ctx->buffers in
zswap_cpu_comp_prepare().
2) SWAP_CRYPTO_BATCH_SIZE stack variables for pages and dlens in
zswap_store_folio().

This would be an additional memory penalty for what we gain by
having the common code paths in zswap for compressors that do
and do not support batching.

Thanks,
Kanchana

>
> It sounds nice on the surface, but this implies that we have to
> allocate folio_nr_pages() buffers in zswap, essentially as the
> allocation is the same size as the folio itself. While the allocation
> does not need to be contiguous, making a large number of allocations
> in the reclaim path is definitely not something we want. For a 2M THP,
> we'd need to allocate 2M in zswap_store().
>
> If we choose to keep preallocating, assuming the maximum THP size is
> 2M, we need to allocate 2M * nr_cpus worth of buffers. That's a lot of
> memory.
>
> I feel like I am missing something.
>
> >
> > Thanks,
> > Kanchana
> >
> > >
> > > Cheers,
> > > --
> > > Email: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx>
> > > Home Page: http://gondor.apana.org.au/~herbert/
> > > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt