Re: [PATCH v3] page_io: zswap: do not crash the kernel on decompression failure

From: Yosry Ahmed
Date: Mon Mar 03 2025 - 16:21:56 EST


On Mon, Mar 03, 2025 at 12:06:27PM -0800, Nhat Pham wrote:
> Currently, we crash the kernel when a decompression failure occurs in
> zswap (either because of memory corruption, or a bug in the compression
> algorithm). This is overkill. We should only SIGBUS the unfortunate
> process asking for the zswap entry on zswap load, and skip the corrupted
> entry in zswap writeback.
>
> See [1] for a recent upstream discussion about this.
>
> The zswap writeback case is relatively straightforward to fix. For the
> zswap_load() case, we reorganize the swap read paths, having
> swap_read_folio_zeromap() and zswap_load() return specific error codes:
>
> * 0 indicates the backend owns the swapped out content, which was
> successfully loaded into the page.
> * -ENOENT indicates the backend does not own the swapped out content.
> * -EINVAL and -EIO indicate the backend own the swapped out content, but
> the content was not successfully loaded into the page for some
> reasons. The folio will not be marked up-to-date, which will
> eventually cause the process requesting the page to SIGBUS (see the
> handling of not-up-to-date folio in do_swap_page() in mm/memory.c).
>
> zswap decompression failures on the zswap load path are treated as an
> -EIO error, as described above, and will no longer crash the kernel.
>
> As a side effect, we require one extra zswap tree traversal in the load
> and writeback paths. Quick benchmarking on a kernel build test shows no
> performance difference:
>
> With the new scheme:
> real: mean: 125.1s, stdev: 0.12s
> user: mean: 3265.23s, stdev: 9.62s
> sys: mean: 2156.41s, stdev: 13.98s
>
> The old scheme:
> real: mean: 125.78s, stdev: 0.45s
> user: mean: 3287.18s, stdev: 5.95s
> sys: mean: 2177.08s, stdev: 26.52s
>
> [1]: https://lore.kernel.org/all/ZsiLElTykamcYZ6J@xxxxxxxxxxxxxxxxxxxx/
>
> Suggested-by: Matthew Wilcox <willy@xxxxxxxxxxxxx>
> Suggested-by: Yosry Ahmed <yosry.ahmed@xxxxxxxxx>
> Suggested-by: Johannes Weiner <hannes@xxxxxxxxxxx>
> Signed-off-by: Nhat Pham <nphamcs@xxxxxxxxx>
> ---
> include/linux/zswap.h | 4 +-
> mm/page_io.c | 35 ++++++++----
> mm/zswap.c | 130 ++++++++++++++++++++++++++++++------------
> 3 files changed, 120 insertions(+), 49 deletions(-)
>
> diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> index d961ead91bf1..9468cb3e0878 100644
> --- a/include/linux/zswap.h
> +++ b/include/linux/zswap.h
> @@ -26,7 +26,7 @@ struct zswap_lruvec_state {
>
> unsigned long zswap_total_pages(void);
> bool zswap_store(struct folio *folio);
> -bool zswap_load(struct folio *folio);
> +int zswap_load(struct folio *folio);
> void zswap_invalidate(swp_entry_t swp);
> int zswap_swapon(int type, unsigned long nr_pages);
> void zswap_swapoff(int type);
> @@ -46,7 +46,7 @@ static inline bool zswap_store(struct folio *folio)
>
> static inline bool zswap_load(struct folio *folio)

int?

> {
> - return false;
> + return -ENOENT;
> }
>
> static inline void zswap_invalidate(swp_entry_t swp) {}
> diff --git a/mm/page_io.c b/mm/page_io.c
> index 9b983de351f9..8a44faec3f92 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -511,7 +511,21 @@ static void sio_read_complete(struct kiocb *iocb, long ret)
> mempool_free(sio, sio_pool);
> }
>
> -static bool swap_read_folio_zeromap(struct folio *folio)
> +/*
> + * Return: one of the following error codes:

nit:
"Returns 0 on success, or one of the following errors on failure:"

Also might as well make this a full kerneldoc?

> + *
> + * 0: the folio is zero-filled (and was populated as such and marked
> + * up-to-date and unlocked).
> + *
> + * -ENOENT: the folio was not zero-filled.
> + *
> + * -EINVAL: some of the subpages in the folio are zero-filled, but not all of
> + * them. This is an error because we don't currently support a large folio
> + * that is partially in the zeromap. The folio is unlocked, but NOT marked
> + * up-to-date, so that an IO error is emitted (e.g. do_swap_page() will
> + * sigbus).
> + */
> +static int swap_read_folio_zeromap(struct folio *folio)
> {
> int nr_pages = folio_nr_pages(folio);
> struct obj_cgroup *objcg;
> @@ -523,11 +537,13 @@ static bool swap_read_folio_zeromap(struct folio *folio)
> * that an IO error is emitted (e.g. do_swap_page() will sigbus).
> */

This comment says "Return true", so it needs to be updated. We probably
don't need to explain the return value anymore since it's documented
above.

> if (WARN_ON_ONCE(swap_zeromap_batch(folio->swap, nr_pages,
> - &is_zeromap) != nr_pages))
> - return true;
> + &is_zeromap) != nr_pages)) {
> + folio_unlock(folio);
> + return -EINVAL;
> + }
>
> if (!is_zeromap)
> - return false;
> + return -ENOENT;
>
> objcg = get_obj_cgroup_from_folio(folio);
> count_vm_events(SWPIN_ZERO, nr_pages);
> @@ -538,7 +554,8 @@ static bool swap_read_folio_zeromap(struct folio *folio)
>
> folio_zero_range(folio, 0, folio_size(folio));
> folio_mark_uptodate(folio);
> - return true;
> + folio_unlock(folio);
> + return 0;
> }
>
> static void swap_read_folio_fs(struct folio *folio, struct swap_iocb **plug)
> @@ -635,13 +652,11 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
> }
> delayacct_swapin_start();
>
> - if (swap_read_folio_zeromap(folio)) {
> - folio_unlock(folio);
> + if (swap_read_folio_zeromap(folio) != -ENOENT)
> goto finish;

I would split the zeromap change into a separate patch, but it's
probably fine either way.

> - } else if (zswap_load(folio)) {
> - folio_unlock(folio);
> +
> + if (zswap_load(folio) != -ENOENT)
> goto finish;
> - }
>
> /* We have to read from slower devices. Increase zswap protection. */
> zswap_folio_swapin(folio);
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 6dbf31bd2218..b67481defc6c 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -62,6 +62,8 @@ static u64 zswap_reject_reclaim_fail;
> static u64 zswap_reject_compress_fail;
> /* Compressed page was too big for the allocator to (optimally) store */
> static u64 zswap_reject_compress_poor;
> +/* Load or writeback failed due to decompression failure */
> +static u64 zswap_decompress_fail;
> /* Store failed because underlying allocator could not get memory */
> static u64 zswap_reject_alloc_fail;
> /* Store failed because the entry metadata could not be allocated (rare) */
> @@ -996,11 +998,12 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> return comp_ret == 0 && alloc_ret == 0;
> }
>
> -static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
> +static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
> {
> struct zpool *zpool = entry->pool->zpool;
> struct scatterlist input, output;
> struct crypto_acomp_ctx *acomp_ctx;
> + int decomp_ret, dlen;
> u8 *src;
>
> acomp_ctx = acomp_ctx_get_cpu_lock(entry->pool);
> @@ -1025,12 +1028,31 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
> sg_init_table(&output, 1);
> sg_set_folio(&output, folio, PAGE_SIZE, 0);
> acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE);
> - BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
> - BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
> + decomp_ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait);
> + dlen = acomp_ctx->req->dlen;
>
> if (src != acomp_ctx->buffer)
> zpool_unmap_handle(zpool, entry->handle);
> acomp_ctx_put_unlock(acomp_ctx);
> +
> + if (decomp_ret || dlen != PAGE_SIZE) {
> + zswap_decompress_fail++;
> + pr_alert_ratelimited(
> + "decompression failed with returned value %d on zswap entry with "

nit: Decompression*

I am also wondering how this looks like in dmesg? Is the line too long
to be read? Should we add some line breaks (e.g. like
warn_sysctl_write()), we could probably also put this in a helper to
keep this function visually easy to follow.

> + "swap entry value %08lx, swap type %d, and swap offset %lu. "
> + "compression algorithm is %s. compressed size is %u bytes, and "
> + "decompressed size is %u bytes.\n",
> + decomp_ret,
> + entry->swpentry.val,
> + swp_type(entry->swpentry),
> + swp_offset(entry->swpentry),
> + entry->pool->tfm_name,
> + entry->length,
> + acomp_ctx->req->dlen);

We should probably be using 'dlen' here since we're outside the lock.

> +
> + return false;
> + }
> + return true;
> }
>
> /*********************************
[..]
> @@ -1620,7 +1651,29 @@ bool zswap_store(struct folio *folio)
> return ret;
> }
>
> -bool zswap_load(struct folio *folio)
> +/**
> + * zswap_load() - load a page from zswap
> + * @folio: folio to load
> + *
> + * Return: one of the following error codes:

nit: Returns 0 on success, or ..

> + *
> + * 0: if the swapped out content was in zswap and was successfully loaded.
> + * The folio is unlocked and marked up-to-date.
> + *
> + * -EIO: if the swapped out content was in zswap, but could not be loaded
> + * into the page (for e.g, because there was a memory corruption, or a
> + * decompression bug). The folio is unlocked, but NOT marked up-to-date,
> + * so that an IO error is emitted (e.g. do_swap_page() will SIGBUS).
> + *
> + * -EINVAL: if the swapped out content was in zswap, but the page belongs
> + * to a large folio, which is not supported by zswap. The folio is unlocked,
> + * but NOT marked up-to-date, so that an IO error is emitted (e.g.
> + * do_swap_page() will SIGBUS).
> + *
> + * -ENOENT: if the swapped out content was not in zswap. The folio remains
> + * locked on return.
> + */
> +int zswap_load(struct folio *folio)
> {
> swp_entry_t swp = folio->swap;
> pgoff_t offset = swp_offset(swp);
[..]