[PATCH v6 00/16] zswap IAA compress batching

From: Kanchana P Sridhar
Date: Thu Feb 06 2025 - 02:21:49 EST

Next message: Kanchana P Sridhar: "[PATCH v6 03/16] crypto: iaa - Add an acomp_req flag CRYPTO_ACOMP_REQ_POLL to enable async mode."
Previous message: Kanchana P Sridhar: "[PATCH v6 02/16] crypto: acomp - Define new interfaces for compress/decompress batching."
Next in thread: Kanchana P Sridhar: "[PATCH v6 01/16] crypto: acomp - Add synchronous/asynchronous acomp request chaining."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

IAA Compression Batching:
=========================

This patch-series introduces the use of the Intel Analytics Accelerator
(IAA) for parallel batch compression of pages in large folios to improve
zswap swapout latency.

Improvements seen with IAA compress batching vs. IAA sequential:

usemem30 with 64K folios:
-------------------------
59.1% higher throughput
30.3% lower elapsed time
36.2% lower sys time

usemem30 with 2M folios:
------------------------
60.2% higher throughput
26.7% lower elapsed time
30.5% lower sys time

There is no performance impact to zstd with v6.

The major focus for v6 was to fix the performance regressions observed in
v5, highlighted by Yosry (Thanks Yosry):

1) zstd performance regression.
2) IAA batching vs. IAA non-batching regression.

The patch-series is organized as follows:

1) crypto acomp & iaa_crypto driver enablers for batching: Relevant
patches are tagged with "crypto:" in the subject:

Patch 1) Adds new acomp request chaining framework and interface based
on Herbert Xu's ahash reference implementation in "[PATCH 2/6]
crypto: hash - Add request chaining API" [1]. acomp algorithms
can use request chaining through these interfaces:

Setup the request chain:
acomp_reqchain_init()
acomp_request_chain()

Process the request chain:
acomp_do_req_chain(): synchronously (sequentially)
acomp_do_async_req_chain(): asynchronously using submit/poll
ops (in parallel)

Patch 2) Adds acomp_alg/crypto_acomp interfaces for batch_compress(),
batch_decompress() and get_batch_size(), that swap modules can
invoke using the new batching API crypto_acomp_batch_compress(),
crypto_acomp_batch_decompress() and crypto_acomp_batch_size().
Additionally, crypto acomp provides a new
acomp_has_async_batching() interface to query for these API
before allocating batching resources for a given compressor in
zswap/zram.

Patch 3) New CRYPTO_ACOMP_REQ_POLL acomp_req flag to act as a gate for
async poll mode in iaa_crypto.

Patch 4) iaa-crypto driver implementations for sync/async
crypto_acomp_batch_compress() and
crypto_acomp_batch_decompress() developed using request
chaining. If the iaa_crypto driver is set up for 'async'
sync_mode, these batching implementations deploy the
asynchronous request chaining implementation. 'async' is the
recommended mode for realizing the benefits of IAA parallelism.
If iaa_crypto is set up for 'sync' sync_mode, the synchronous
version of the request chaining API is used.

The "iaa_acomp_fixed_deflate" algorithm registers these
implementations for its "batch_compress" and "batch_decompress"
interfaces respectively and opts in with CRYPTO_ALG_REQ_CHAIN.
Further, iaa_crypto provides an implementation for the
"get_batch_size" interface: this returns the
IAA_CRYPTO_MAX_BATCH_SIZE constant that iaa_crypto defines
currently as 8U for IAA compression algorithms (iaa_crypto can
change this if needed as we optimize our batching algorithm).

Patch 5) Modifies the default iaa_crypto driver mode to async, now that
iaa_crypto provides a truly async mode that gives
significantly better latency than sync mode for the batching
use case.

Patch 6) Disables verify_compress by default, to facilitate users to
run IAA easily for comparison with software compressors.

Patch 7) Reorganizes the iaa_crypto driver code into logically related
sections and avoids forward declarations, in order to facilitate
Patch 8. This patch makes no functional changes.

Patch 8) Makes a major infrastructure change in the iaa_crypto driver,
to map IAA devices/work-queues to cores based on packages
instead of NUMA nodes. This doesn't impact performance on
the Sapphire Rapids system used for performance
testing. However, this change fixes functional problems we
found on Granite Rapids in internal validation, where the
number of NUMA nodes is greater than the number of packages,
which was resulting in over-utilization of some IAA devices
and non-usage of other IAA devices as per the current NUMA
based mapping infrastructure.
This patch also eliminates duplication of device wqs in
per-cpu wq_tables, thereby saving 140MiB on a 384 cores
Granite Rapids server with 8 IAAs. Submitting this change now
so that it can go through code reviews before it can be merged.

Patch 9) Builds upon the new infrastructure for mapping IAAs to cores
based on packages, and enables configuring a "global_wq" per
IAA, which can be used as a global resource for compress jobs
for the package. If the user configures 2WQs per IAA device,
the driver will distribute compress jobs from all cores on the
package to the "global_wqs" of all the IAA devices on that
package, in a round-robin manner. This can be used to improve
compression throughput for workloads that see a lot of swapout
activity.

Patch 10) Makes an important change to iaa_crypto driver's descriptor
allocation, from blocking to non-blocking with
retries/timeouts and mitigations in case of timeouts during
compress/decompress ops. This prevents tasks getting blocked
indefinitely, which was observed when testing 30 cores running
workloads, with only 1 IAA enabled on Sapphire Rapids (out of
4). These timeouts are typically only encountered, and
associated mitigations exercised, only in configurations with
1 IAA device shared by 30+ cores.

Patch 11) Fixes a bug with the "deflate_generic_tfm" global being
accessed without locks in the software decomp fallback code.

2) zswap modifications to enable compress batching in zswap_store()
of large folios (including pmd-mappable folios):

Patch 12) Defines a zswap-specific ZSWAP_MAX_BATCH_SIZE (currently set
as 8U) to denote the maximum number of acomp_ctx batching
resources. Further, the "struct crypto_acomp_ctx" is modified
to contain a configurable number of acomp_reqs and buffers.
The cpu hotplug onlining code will allocate up to
ZSWAP_MAX_BATCH_SIZE requests/buffers in the per-cpu
acomp_ctx, thereby limiting the memory usage in zswap, and
ensuring that non-batching compressors incur no memory penalty.

Patch 13) Restructures & simplifies zswap_store() to make it amenable
for batching. Moves the loop over the folio's pages to a new
zswap_store_folio(), which in turn allocates zswap entries
for all folio pages upfront, then calls zswap_compress() for
each folio page.

Patch 14) Introduces zswap_compress_folio() to compress all pages in a
folio.

Patch 15) We modify zswap_compress_folio() to detect if the compressor
supports batching. If so, the "acomp_ctx->nr_reqs" becomes the
batch size with which we call crypto_acomp_batch_compress() to
compress multiple folio pages in parallel in IAA. Upon
successful compression of a batch, the compressed buffers are
stored in zpool.

For compressors that don't support batching,
zswap_compress_folio() will call zswap_compress() for each
page in the folio.

However, although we observe significantly better IAA batching
performance/throughput, there was also a significant
performance regression observed with zstd/2M folios. This is
fixed in patch 16.

Based on the discussions in [2], patch 15 invokes
crypto_acomp_batch_compress() with "NULL" for the @wait
parameter. This causes iaa_crypto's iaa_comp_acompress_batch()
to use asynchronous polling instead of async request chaining
for now, until there is better clarity on request
chaining. Further, testing with micro-benchmarks indicated a
slight increase in latency with request chaining:

crypto_acomp_batch_compress() p05 (ns) p50 (ns) p99 (ns)
-------------------------------------------------------------
async polling 5,279 5,589 8,875
async request chaining 5,316 5,662 8,923
-------------------------------------------------------------

Patch 16) The zstd 2M regression is fixed. We now see no regressions
with zstd, and impressive throughput/performance improvements
with IAA batching vs. no-batching.

With v6 of this patch series, the IAA compress batching feature will be
enabled seamlessly on Intel platforms that have IAA by selecting
'deflate-iaa' as the zswap compressor, and using the iaa_crypto 'async'
sync_mode driver attribute.

[1]: https://lore.kernel.org/linux-crypto/677614fbdc70b31df2e26483c8d2cd1510c8af91.1730021644.git.herbert@xxxxxxxxxxxxxxxxxxx/
[2]: https://patchwork.kernel.org/project/linux-mm/patch/20241221063119.29140-3-kanchana.p.sridhar@xxxxxxxxx/

System setup for testing:
=========================
Testing of this patch-series was done with mm-unstable as of 2-1-2025,
commit 7de6fd8ab650, without and with this patch-series.
Data was gathered on an Intel Sapphire Rapids (SPR) server, dual-socket
56 cores per socket, 4 IAA devices per socket, 503 GiB RAM and 525G SSD
disk partition swap. Core frequency was fixed at 2500MHz.

Other kernel configuration parameters:

zswap compressor : zstd, deflate-iaa
zswap allocator : zsmalloc
vm.page-cluster : 0

IAA "compression verification" is disabled and IAA is run in the async
mode (the defaults with this series).

I ran experiments with these workloads:

1) usemem 30 processes with these large folios enabled to "always":
- 64k
- 2048k

2) Kernel compilation allmodconfig with 2G max memory, 32 threads, run in
tmpfs with these large folios enabled to "always":
- 64k

usemem30 and kernel compilation used different IAA WQs configurations:

usemem30 IAA WQ configuration:
------------------------------
1 WQ with 128 entries per device. Compress/decompress jobs are sent to
the same WQ and IAA that is mapped to the cores. There is very less
swapin activity in this workload, and allocating 2WQs (one for decomps,
one for comps, each 64 entries) degrades compress batching latency. This
IAA WQ configuration explains the insignificant performance gains seen
with IAA batching in v5, and once again delivers the expected performance
improvements with batching.

Kernel compilation IAA WQ configuration:
----------------------------------------
2WQs, with 64 entries each, are configured per IAA device. Compress jobs
from all cores on a socket are distributed among all 4 IAA devices on the
same socket.

Performance testing (usemem30):
===============================
The vm-scalability "usemem" test was run in a cgroup whose memory.high
was fixed at 150G. The is no swap limit set for the cgroup. 30 usemem
processes were run, each allocating and writing 10G of memory, and sleeping
for 10 sec before exiting:

usemem --init-time -w -O -b 1 -s 10 -n 30 10g

One important difference in v6's experiments is that the 30 usemem
processes are pinned to 30 consecutive cores on socket 0, which makes use
of the IAA devices only on socket 0.

64K folios: usemem30: deflate-iaa:
==================================

-------------------------------------------------------------------------------
mm-unstable-2-1-2025 v6 v6
-------------------------------------------------------------------------------
zswap compressor deflate-iaa deflate-iaa deflate-iaa IAA Batching
vs.
Sequential
-------------------------------------------------------------------------------
Total throughput (KB/s) 6,039,595 9,679,965 9,537,327 59.1%
Avg throughput (KB/s) 201,319 322,665 317,910
elapsed time (sec) 100.74 69.05 71.43 -30.3%
sys time (sec) 2,446.53 1,526.71 1,596.23 -36.2%

-------------------------------------------------------------------------------
memcg_high 909,501 961,527 964,010
memcg_swap_fail 1,580 733 2,393
zswpout 58,342,295 61,542,432 61,715,737
zswpin 425 80 442
pswpout 0 0 0
pswpin 0 0 0
thp_swpout 0 0 0
thp_swpout_fallback 0 0 0
64kB_swpout_fallback 1,580 733 2,393
pgmajfault 3,311 2,860 3,220
anon_fault_alloc_64kB 4,924,571 4,924,545 4,924,104
ZSWPOUT-64kB 3,644,769 3,845,652 3,854,791
SWPOUT-64kB 0 0 0
-------------------------------------------------------------------------------

2M folios: usemem30: deflate-iaa:
=================================

-------------------------------------------------------------------------------
mm-unstable-2-1-2025 v6 v6
-------------------------------------------------------------------------------
zswap compressor deflate-iaa deflate-iaa deflate-iaa IAA Batching
vs.
Sequential
-------------------------------------------------------------------------------
Total throughput (KB/s) 6,334,585 10,068,264 10,230,633 60.2%
Avg throughput (KB/s) 211,152 335,608 341,021
elapsed time (sec) 87.68 65.74 62.86 -26.7%
sys time (sec) 2,031.84 1,454.93 1,370.87 -30.5%

-------------------------------------------------------------------------------
memcg_high 115,322 121,226 120,093
memcg_swap_fail 568 350 301
zswpout 559,323,303 62,474,427 61,907,590
zswpin 518 463 14
pswpout 0 0 0
pswpin 0 0 0
thp_swpout 0 0 0
thp_swpout_fallback 568 350 301
pgmajfault 3,298 3,247 2,826
anon_fault_alloc_2048kB 153,734 153,734 153,737
ZSWPOUT-2048kB 115,321 121,672 120,614
SWPOUT-2048kB 0 0 0
-------------------------------------------------------------------------------

64K folios: usemem30: zstd:
===========================

-------------------------------------------------------------------------------
mm-unstable-2-1-2025 v6 v6 v6
Patch 15 Patch 16 Patch 16
(regr) (fixed) (fixed)
-------------------------------------------------------------------------------
zswap compressor zstd zstd zstd zstd
-------------------------------------------------------------------------------
Total throughput (KB/s) 6,929,741 6,975,265 7,003,546 6,953,025
Avg throughput (KB/s) 230,991 232,508 233,451 231,767
elapsed time (sec) 88.59 87.32 87.45 88.57
sys time (sec) 2,188.83 2,136.52 2,133.41 2,178.23

-------------------------------------------------------------------------------
memcg_high 764,423 764,174 764,420 764,476
memcg_swap_fail 1,236 15 1,234 16
zswpout 48,928,758 48,908,998 48,928,536 48,928,551
zswpin 421 68 396 100
pswpout 0 0 0 0
pswpin 0 0 0 0
thp_swpout 0 0 0 0
thp_swpout_fallback 0 0 0 0
64kB_swpout_fallback 1,236 15 1,234 16
pgmajfault 3,196 2,875 3,570 3,284
anon_fault_alloc_64kB 4,924,288 4,924,406 4,924,161 4,924,064
ZSWPOUT-64kB 3,056,753 3,056,772 3,056,745 3,057,979
SWPOUT-64kB 0 0 0 0
-------------------------------------------------------------------------------

2M folios: usemem30: zstd:
==========================

-------------------------------------------------------------------------------
mm-unstable-2-1-2025 v6 v6 v6
Patch 15 Patch 16 Patch 16
(regr) (fixed) (fixed)
-------------------------------------------------------------------------------
zswap compressor zstd zstd zstd zstd
-------------------------------------------------------------------------------
Total throughput (KB/s) 7,712,462 7,235,682 7,716,994 7,745,378
Avg throughput (KB/s) 257,082 241,189 257,233 258,179
elapsed time (sec) 84.94 89.54 86.96 85.82
sys time (sec) 2,008.19 2,141.90 2,059.80 2,039.96

-------------------------------------------------------------------------------
memcg_high 93,036 94,792 93,137 93,100
memcg_swap_fail 143 169 32 11
zswpout 48,062,240 48,929,604 48,113,722 48,073,739
zswpin 439 438 71 9
pswpout 0 0 0 0
pswpin 0 0 0 0
thp_swpout 0 0 0 0
thp_swpout_fallback 143 169 32 11
pgmajfault 3,246 3,645 3,248 2,775
anon_fault_alloc_2048kB 153,739 153,738 153,740 153,733
ZSWPOUT-2048kB 93,726 95,398 93,940 93,883
SWPOUT-2048kB 0 0 0 0
-------------------------------------------------------------------------------

zstd 2M regression fix details:
-------------------------------
Patch 16 essentially adapts the batching implementation of
zswap_store_folio() to be sequential, i.e., for the behavior to be the
same as the earlier zswap_store_page() iteration over the folio's
pages. It attempts to preserve common code paths.

I wasn't able to quantify why the Patch 15 implementation caused
the zstd regression with the usual profiling methods such as
tracepoints/bpftrace. My best hypothesis as to why Patch 16 resolves the
regression, is that it has to do with a combination of branch mispredicts
and the working set in the zswap_store_folio() code blocks having to load
and iterate over 512 pages in the 3 loops. I gathered perf event counts
that seem to back up this hypothesis:

-------------------------------------------------------------------------------
usemem30, zstd, v6 Patch 15 v6 Patch 16 Change in
2M Folios, zstd 2M regression Fixes zstd PMU events
perf stats 2M regression with fix
-------------------------------------------------------------------------------
branch-misses 1,571,192,128 1,545,342,571 -25,849,557
L1-dcache-stores 1,211,615,528,323 1,190,695,049,961 -20,920,478,362
L1-dcache-loads 3,357,273,843,074 3,307,817,975,881 -49,455,867,193
LLC-store-misses 3,357,428,475 3,340,252,023 -17,176,452
branch-load-misses 1,567,824,197 1,546,321,034 -21,503,163
branch-loads 1,463,632,526,371 1,449,551,102,173 -14,081,424,198
mem-stores 1,211,399,592,024 1,191,473,855,029 -19,925,736,995
dTLB-loads 3,367,449,558,533 3,308,475,712,698 -58,973,845,835
LLC-loads 1,867,235,354 1,773,790,017 -93,445,337
node-load-misses 4,057,323 3,959,741 -97,582
major-faults 241 0 -241
L1-dcache-load-misses 22,339,515,994 24,381,783,235 2,042,267,241
L1-icache-load-misses 21,182,690,283 26,504,876,405 5,322,186,122
LLC-load-misses 224,000,082 258,495,328 34,495,246
node-loads 221,425,627 256,372,686 34,947,059
mem-loads 0 0 0
dTLB-load-misses 4,886,686 8,672,079 3,785,393
iTLB-load-misses 1,548,637 4,268,093 2,719,456
cache-misses 10,831,533,095 10,834,598,425 3,065,330
minor-faults 155,246 155,707 461
-------------------------------------------------------------------------------

4K folios: usemem30: Regression testing:
========================================

-------------------------------------------------------------------------------
mm-unstable v6 mm-unstable v6
-------------------------------------------------------------------------------
zswap compressor deflate-iaa deflate-iaa zstd zstd
-------------------------------------------------------------------------------
Total throughput (KB/s) 5,155,471 6,031,332 6,453,431 6,566,026
Avg throughput (KB/s) 171,849 201,044 215,114 218,867
elapsed time (sec) 108.35 92.61 95.50 88.99
sys time (sec) 2,400.32 2,212.06 2,417.16 2,207.35

-------------------------------------------------------------------------------
memcg_high 670,635 1,007,763 764,456 764,470
memcg_swap_fail 0 0 0 0
zswpout 62,098,929 64,507,508 48,928,772 48,928,690
zswpin 425 77 457 461
pswpout 0 0 0 0
pswpin 0 0 0 0
thp_swpout 0 0 0 0
thp_swpout_fallback 0 0 0 0
pgmajfault 3,271 2,864 3,240 3,242
-------------------------------------------------------------------------------

Performance testing (Kernel compilation, allmodconfig):
=======================================================

The experiments with kernel compilation test, 32 threads, in tmpfs use the
"allmodconfig" that takes ~12 minutes, and has considerable swapout/swapin
activity. The cgroup's memory.max is set to 2G.

64K folios: Kernel compilation/allmodconfig:
============================================

-------------------------------------------------------------------------------
mm-unstable v6 mm-unstable v6
-------------------------------------------------------------------------------
zswap compressor deflate-iaa deflate-iaa zstd zstd
-------------------------------------------------------------------------------
real_sec 767.36 743.90 776.08 769.43
user_sec 15,773.57 15,773.34 15,780.93 15,736.49
sys_sec 4,209.63 4,013.51 5,392.85 5,046.05
-------------------------------------------------------------------------------
Max_Res_Set_Size_KB 1,874,680 1,873,776 1,874,244 1,873,456
-------------------------------------------------------------------------------
memcg_high 0 0 0 0
memcg_swap_fail 0 0 0 0
zswpout 109,623,799 110,737,958 89,488,777 81,553,126
zswpin 33,303,441 33,295,883 26,753,716 23,266,542
pswpout 315 151 99 116
pswpin 80 54 64 32
thp_swpout 0 0 0 0
thp_swpout_fallback 0 0 0 0
64kB_swpout_fallback 0 348 0 0
pgmajfault 35,606,216 35,462,017 28,488,538 24,703,903
ZSWPOUT-64kB 3,551,578 3,596,675 2,814,435 2,603,649
SWPOUT-64kB 19 5 5 7
-------------------------------------------------------------------------------

With the iaa_crypto driver changes for non-blocking descriptor allocations
no timeouts-with-mitigations were seen in compress/decompress jobs, for all
of the above experiments.

Summary:
========
The performance testing data with usemem 30 processes and kernel
compilation test show 60% throughput gains and 36% sys time reduction
(usemem30) and 5% sys time reduction (kernel compilation) with
zswap_store() large folios using IAA compress batching as compared to
IAA sequential. There is no performance regression for zstd.

We can expect to see even more significant performance and throughput
improvements if we use the parallelism offered by IAA to do reclaim
batching of 4K/large folios (really any-order folios), and using the
zswap_store() high throughput compression to batch-compress pages
comprising these folios, not just batching within large folios. This is the
reclaim batching patch 13 in v1, which will be submitted in a separate
patch-series.

Our internal validation of IAA compress/decompress batching in highly
contended Sapphire Rapids server setups with workloads running on 72 cores
for ~25 minutes under stringent memory limit constraints have shown up to
50% reduction in sys time and 21.3% more memory savings with IAA, as
compared to zstd, for same performance. IAA batching demonstrates more than
2X the memory savings obtained by zstd for same performance.

Changes since v5:
=================
1) Rebased to mm-unstable as of 2-1-2025, commit 7de6fd8ab650.

Several improvements, regression fixes and bug fixes, based on Yosry's
v5 comments (Thanks Yosry!):

2) Fix for zstd performance regression in v5.
3) Performance debug and fix for marginal improvements with IAA batching
vs. sequential.
4) Performance testing data compares IAA with and without batching, instead
of IAA batching against zstd.
5) Commit logs/zswap comments not mentioning crypto_acomp implementation
details.
6) Delete the pr_info_once() when batching resources are allocated in
zswap_cpu_comp_prepare().
7) Use kcalloc_node() for the multiple acomp_ctx buffers/reqs in
zswap_cpu_comp_prepare().
8) Simplify and consolidate error handling cleanup code in
zswap_cpu_comp_prepare().
9) Introduce zswap_compress_folio() in a separate patch.
10) Bug fix in zswap_store_folio() when xa_store() failure can cause all
compressed objects and entries to be freed, and UAF when zswap_store()
tries to free the entries that were already added to the xarray prior
to the failure.
11) Deleting compressed_bytes/bytes. zswap_store_folio() also comprehends
the recent fixes in commit bf5eaaaf7941 ("mm/zswap: fix inconsistency
when zswap_store_page() fails") by Hyeonggon Yoo.

iaa_crypto improvements/fixes/changes:

12) Enables asynchronous mode and makes it the default. With commit
4ebd9a5ca478 ("crypto: iaa - Fix IAA disabling that occurs when
sync_mode is set to 'async'"), async mode was previously just sync. We
now have true async support.
13) Change idxd descriptor allocations from blocking to non-blocking with
timeouts, and mitigations for compress/decompress ops that fail to
obtain a descriptor. This is a fix for tasks blocked errors seen in
configurations where 30+ cores are running workloads under high memory
pressure, and sending comps/decomps to 1 IAA device.
14) Fixes a bug with unprotected access of "deflate_generic_tfm" in
deflate_generic_decompress(), which can cause data corruption and
zswap_decompress() kernel crash.
15) zswap uses crypto_acomp_batch_compress() with async polling instead of
request chaining for slightly better latency. However, the request
chaining framework itself is unchanged, preserved from v5.

Changes since v4:
=================
1) Rebased to mm-unstable as of 12-20-2024, commit 5555a83c82d6.
2) Added acomp request chaining, as suggested by Herbert. Thanks Herbert!
3) Implemented IAA compress batching using request chaining.
4) zswap_store() batching simplifications suggested by Chengming, Yosry and
Nhat, thanks to all!
- New zswap_compress_folio() that is called by zswap_store().
- Move the loop over folio's pages out of zswap_store() and into a
zswap_store_folio() that stores all pages.
- Allocate all zswap entries for the folio upfront.
- Added zswap_batch_compress().
- Branch to call zswap_compress() or zswap_batch_compress() inside
zswap_compress_folio().
- All iterations over pages kept in same function level.
- No helpers other than the newly added zswap_store_folio() and
zswap_compress_folio().

Changes since v3:
=================
1) Rebased to mm-unstable as of 11-18-2024, commit 5a7056135bb6.
2) Major re-write of iaa_crypto driver's mapping of IAA devices to cores,
based on packages instead of NUMA nodes.
3) Added acomp_has_async_batching() API to crypto acomp, that allows
zswap/zram to query if a crypto_acomp has registered batch_compress and
batch_decompress interfaces.
4) Clear the poll bits on the acomp_reqs passed to
iaa_comp_a[de]compress_batch() so that a module like zswap can be
confident about the acomp_reqs[0] not having the poll bit set before
calling the fully synchronous API crypto_acomp_[de]compress().
Herbert, I would appreciate it if you can review changes 2-4; in patches
1-8 in v4. I did not want to introduce too many iaa_crypto changes in
v4, given that patch 7 is already making a major change. I plan to work
on incorporating the request chaining using the ahash interface in v5
(I need to understand the basic crypto ahash better). Thanks Herbert!
5) Incorporated Johannes' suggestion to not have a sysctl to enable
compress batching.
6) Incorporated Yosry's suggestion to allocate batching resources in the
cpu hotplug onlining code, since there is no longer a sysctl to control
batching. Thanks Yosry!
7) Incorporated Johannes' suggestions related to making the overall
sequence of events between zswap_store() and zswap_batch_store() similar
as much as possible for readability and control flow, better naming of
procedures, avoiding forward declarations, not inlining error path
procedures, deleting zswap internal details from zswap.h, etc. Thanks
Johannes, really appreciate the direction!
I have tried to explain the minimal future-proofing in terms of the
zswap_batch_store() signature and the definition of "struct
zswap_batch_store_sub_batch" in the comments for this struct. I hope the
new code explains the control flow a bit better.

Changes since v2:
=================
1) Rebased to mm-unstable as of 11-5-2024, commit 7994b7ea6ac8.
2) Fixed an issue in zswap_create_acomp_ctx() with checking for NULL
returned by kmalloc_node() for acomp_ctx->buffers and for
acomp_ctx->reqs.
3) Fixed a bug in zswap_pool_can_batch() for returning true if
pool->can_batch_comp is found to be equal to BATCH_COMP_ENABLED, and if
the per-cpu acomp_batch_ctx tests true for batching resources having
been allocated on this cpu. Also, changed from per_cpu_ptr() to
raw_cpu_ptr().
4) Incorporated the zswap_store_propagate_errors() compilation warning fix
suggested by Dan Carpenter. Thanks Dan!
5) Replaced the references to SWAP_CRYPTO_SUB_BATCH_SIZE in comments in
zswap.h, with SWAP_CRYPTO_BATCH_SIZE.

Changes since v1:
=================
1) Rebased to mm-unstable as of 11-1-2024, commit 5c4cf96cd702.
2) Incorporated Herbert's suggestions to use an acomp_req flag to indicate
async/poll mode, and to encapsulate the polling functionality in the
iaa_crypto driver. Thanks Herbert!
3) Incorporated Herbert's and Yosry's suggestions to implement the batching
API in iaa_crypto and to make its use seamless from zswap's
perspective. Thanks Herbert and Yosry!
4) Incorporated Yosry's suggestion to make it more convenient for the user
to enable compress batching, while minimizing the memory footprint
cost. Thanks Yosry!
5) Incorporated Yosry's suggestion to de-couple the shrink_folio_list()
reclaim batching patch from this series, since it requires a broader
discussion.

I would greatly appreciate code review comments for the iaa_crypto driver
and mm patches included in this series!

Thanks,
Kanchana

Kanchana P Sridhar (16):
crypto: acomp - Add synchronous/asynchronous acomp request chaining.
crypto: acomp - Define new interfaces for compress/decompress
batching.
crypto: iaa - Add an acomp_req flag CRYPTO_ACOMP_REQ_POLL to enable
async mode.
crypto: iaa - Implement batch_compress(), batch_decompress() API in
iaa_crypto.
crypto: iaa - Enable async mode and make it the default.
crypto: iaa - Disable iaa_verify_compress by default.
crypto: iaa - Re-organize the iaa_crypto driver code.
crypto: iaa - Map IAA devices/wqs to cores based on packages instead
of NUMA.
crypto: iaa - Distribute compress jobs from all cores to all IAAs on a
package.
crypto: iaa - Descriptor allocation timeouts with mitigations in
iaa_crypto.
crypto: iaa - Fix for "deflate_generic_tfm" global being accessed
without locks.
mm: zswap: Allocate pool batching resources if the compressor supports
batching.
mm: zswap: Restructure & simplify zswap_store() to make it amenable
for batching.
mm: zswap: Introduce zswap_compress_folio() to compress all pages in a
folio.
mm: zswap: Compress batching with Intel IAA in zswap_store() of large
folios.
mm: zswap: Fix for zstd performance regression with 2M folios.

.../driver-api/crypto/iaa/iaa-crypto.rst | 11 +-
crypto/acompress.c | 287 +++
drivers/crypto/intel/iaa/iaa_crypto.h | 30 +-
drivers/crypto/intel/iaa/iaa_crypto_main.c | 1724 ++++++++++++-----
include/crypto/acompress.h | 157 ++
include/crypto/algapi.h | 10 +
include/crypto/internal/acompress.h | 29 +
include/linux/crypto.h | 31 +
mm/zswap.c | 449 ++++-
9 files changed, 2170 insertions(+), 558 deletions(-)

base-commit: 7de6fd8ab65003f050aa58e705592745717ed318
--
2.27.0

Next message: Kanchana P Sridhar: "[PATCH v6 03/16] crypto: iaa - Add an acomp_req flag CRYPTO_ACOMP_REQ_POLL to enable async mode."
Previous message: Kanchana P Sridhar: "[PATCH v6 02/16] crypto: acomp - Define new interfaces for compress/decompress batching."
Next in thread: Kanchana P Sridhar: "[PATCH v6 01/16] crypto: acomp - Add synchronous/asynchronous acomp request chaining."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]