RE: [RFC PATCH v1 00/13] zswap IAA compress batching

From: Sridhar, Kanchana P
Date: Wed Oct 23 2024 - 16:34:25 EST



> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
> Sent: Wednesday, October 23, 2024 11:16 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
> Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
> hannes@xxxxxxxxxxx; nphamcs@xxxxxxxxx; chengming.zhou@xxxxxxxxx;
> usamaarif642@xxxxxxxxx; ryan.roberts@xxxxxxx; Huang, Ying
> <ying.huang@xxxxxxxxx>; 21cnbao@xxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx;
> linux-crypto@xxxxxxxxxxxxxxx; herbert@xxxxxxxxxxxxxxxxxxx;
> davem@xxxxxxxxxxxxx; clabbe@xxxxxxxxxxxx; ardb@xxxxxxxxxx;
> ebiggers@xxxxxxxxxx; surenb@xxxxxxxxxx; Accardi, Kristen C
> <kristen.c.accardi@xxxxxxxxx>; zanussi@xxxxxxxxxx; viro@xxxxxxxxxxxxxxxxxx;
> brauner@xxxxxxxxxx; jack@xxxxxxx; mcgrof@xxxxxxxxxx; kees@xxxxxxxxxx;
> joel.granados@xxxxxxxxxx; bfoster@xxxxxxxxxx; willy@xxxxxxxxxxxxx; linux-
> fsdevel@xxxxxxxxxxxxxxx; Feghali, Wajdi K <wajdi.k.feghali@xxxxxxxxx>; Gopal,
> Vinodh <vinodh.gopal@xxxxxxxxx>
> Subject: Re: [RFC PATCH v1 00/13] zswap IAA compress batching
>
> On Tue, Oct 22, 2024 at 7:53 PM Sridhar, Kanchana P
> <kanchana.p.sridhar@xxxxxxxxx> wrote:
> >
> > Hi Yosry,
> >
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
> > > Sent: Tuesday, October 22, 2024 5:57 PM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
> > > Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
> > > hannes@xxxxxxxxxxx; nphamcs@xxxxxxxxx;
> chengming.zhou@xxxxxxxxx;
> > > usamaarif642@xxxxxxxxx; ryan.roberts@xxxxxxx; Huang, Ying
> > > <ying.huang@xxxxxxxxx>; 21cnbao@xxxxxxxxx; akpm@linux-
> foundation.org;
> > > linux-crypto@xxxxxxxxxxxxxxx; herbert@xxxxxxxxxxxxxxxxxxx;
> > > davem@xxxxxxxxxxxxx; clabbe@xxxxxxxxxxxx; ardb@xxxxxxxxxx;
> > > ebiggers@xxxxxxxxxx; surenb@xxxxxxxxxx; Accardi, Kristen C
> > > <kristen.c.accardi@xxxxxxxxx>; zanussi@xxxxxxxxxx;
> viro@xxxxxxxxxxxxxxxxxx;
> > > brauner@xxxxxxxxxx; jack@xxxxxxx; mcgrof@xxxxxxxxxx;
> kees@xxxxxxxxxx;
> > > joel.granados@xxxxxxxxxx; bfoster@xxxxxxxxxx; willy@xxxxxxxxxxxxx;
> linux-
> > > fsdevel@xxxxxxxxxxxxxxx; Feghali, Wajdi K <wajdi.k.feghali@xxxxxxxxx>;
> Gopal,
> > > Vinodh <vinodh.gopal@xxxxxxxxx>
> > > Subject: Re: [RFC PATCH v1 00/13] zswap IAA compress batching
> > >
> > > On Thu, Oct 17, 2024 at 11:41 PM Kanchana P Sridhar
> > > <kanchana.p.sridhar@xxxxxxxxx> wrote:
> > > >
> > > >
> > > > IAA Compression Batching:
> > > > =========================
> > > >
> > > > This RFC patch-series introduces the use of the Intel Analytics
> Accelerator
> > > > (IAA) for parallel compression of pages in a folio, and for batched reclaim
> > > > of hybrid any-order batches of folios in shrink_folio_list().
> > > >
> > > > The patch-series is organized as follows:
> > > >
> > > > 1) iaa_crypto driver enablers for batching: Relevant patches are tagged
> > > > with "crypto:" in the subject:
> > > >
> > > > a) async poll crypto_acomp interface without interrupts.
> > > > b) crypto testmgr acomp poll support.
> > > > c) Modifying the default sync_mode to "async" and disabling
> > > > verify_compress by default, to facilitate users to run IAA easily for
> > > > comparison with software compressors.
> > > > d) Changing the cpu-to-iaa mappings to more evenly balance cores to
> IAA
> > > > devices.
> > > > e) Addition of a "global_wq" per IAA, which can be used as a global
> > > > resource for the socket. If the user configures 2WQs per IAA device,
> > > > the driver will distribute compress jobs from all cores on the
> > > > socket to the "global_wqs" of all the IAA devices on that socket, in
> > > > a round-robin manner. This can be used to improve compression
> > > > throughput for workloads that see a lot of swapout activity.
> > > >
> > > > 2) Migrating zswap to use async poll in
> zswap_compress()/decompress().
> > > > 3) A centralized batch compression API that can be used by swap
> modules.
> > > > 4) IAA compress batching within large folio zswap stores.
> > > > 5) IAA compress batching of any-order hybrid folios in
> > > > shrink_folio_list(). The newly added "sysctl vm.compress-batchsize"
> > > > parameter can be used to configure the number of folios in [1, 32] to
> > > > be reclaimed using compress batching.
> > >
> > > I am still digesting this series but I have some high level questions
> > > that I left on some patches. My intuition though is that we should
> > > drop (5) from the initial proposal as it's most controversial.
> > > Batching reclaim of unrelated folios through zswap *might* make sense,
> > > but it needs a broader conversation and it needs justification on its
> > > own merit, without the rest of the series.
> >
> > Thanks for these suggestions! Sure, I can drop (5) from the initial patch-set.
> > Agree also, this needs a broader discussion.
> >
> > I believe the 4K folios usemem30 data in this patchset does bring across
> > the batching reclaim benefits to provide justification on its own merit. I
> added
> > the data on batching reclaim with kernel compilation as part of the 4K folios
> > experiments in the IAA decompression batching patch-series [1].
> > Listing it here as well. I will make sure to add this data in subsequent revs.
> >
> > --------------------------------------------------------------------------
> > Kernel compilation in tmpfs/allmodconfig, 2G max memory:
> >
> > No large folios mm-unstable-10-16-2024 shrink_folio_list()
> > batching of folios
> > --------------------------------------------------------------------------
> > zswap compressor zstd deflate-iaa deflate-iaa
> > vm.compress-batchsize n/a n/a 32
> > vm.page-cluster 3 3 3
> > --------------------------------------------------------------------------
> > real_sec 783.87 761.69 747.32
> > user_sec 15,750.07 15,716.69 15,728.39
> > sys_sec 6,522.32 5,725.28 5,399.44
> > Max_RSS_KB 1,872,640 1,870,848 1,874,432
> >
> > zswpout 82,364,991 97,739,600 102,780,612
> > zswpin 21,303,393 27,684,166 29,016,252
> > pswpout 13 222 213
> > pswpin 12 209 202
> > pgmajfault 17,114,339 22,421,211 23,378,161
> > swap_ra 4,596,035 5,840,082 6,231,646
> > swap_ra_hit 2,903,249 3,682,444 3,940,420
> > --------------------------------------------------------------------------
> >
> > The performance improvements seen does depend on compression batching
> in
> > the swap modules (zswap). The implementation in patch 12 in the compress
> > batching series sets up this zswap compression pipeline, that takes an array
> of
> > folios and processes them in batches of 8 pages compressed in parallel in
> hardware.
> > That being said, we do see latency improvements even with reclaim
> batching
> > combined with zswap compress batching with zstd/lzo-rle/etc. I haven't
> done a
> > lot of analysis of this, but I am guessing fewer calls from the swap layer
> > (swap_writepage()) into zswap could have something to do with this. If we
> believe
> > that batching can be the right thing to do even for the software
> compressors,
> > I can gather batching data with zstd for v2.
>
> Thanks for sharing the data. What I meant is, I think we should focus
> on supporting large folio compression batching for this series, and
> only present figures for this support to avoid confusion.
>
> Once this lands, we can discuss support for batching the compression
> of different unrelated folios separately, as it spans areas beyond
> just zswap and will need broader discussion.

Absolutely, this makes sense, thanks Yosry! I will address this in v2.

Thanks,
Kanchana