Re: [RFC PATCH v2 2/7] mm, swap: support zswap and zeroswap as vswap backends

From: Nhat Pham

Date: Wed Jun 24 2026 - 19:08:28 EST


On Mon, Jun 22, 2026 at 5:16 PM Yosry Ahmed <yosry@xxxxxxxxxx> wrote:
>
> On Fri, Jun 12, 2026 at 12:37:33PM -0700, Nhat Pham wrote:
> > Build the virtual swap layer on top of the swap-table infrastructure.
> > Virtual swap entries decouple PTE swap entries from physical backing,
> > allowing pages to be compressed by zswap (or detected as zero-filled)
> > without pre-allocating a physical swap slot.
> >
> > This patch only supports zswap and zero-page backends. If zswap_store
> > fails, the page stays dirty in the swap cache (AOP_WRITEPAGE_ACTIVATE)
> > - physical disk backing fallback comes in the next patch. Zswap
> > writeback of vswap-backed entries is also disabled - the shrinker
> > skips when no physical swap pages are available.
> >
> > Suggested-by: Kairui Song <kasong@xxxxxxxxxxx>
> > Signed-off-by: Nhat Pham <nphamcs@xxxxxxxxx>
> [..]
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 993406074d58..466f8a182716 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -38,6 +38,7 @@
> > #include <linux/zsmalloc.h>
> >
> > #include "swap.h"
> > +#include "vswap.h"
> > #include "internal.h"
> >
> > /*********************************
> > @@ -762,7 +763,7 @@ static void zswap_entry_cache_free(struct zswap_entry *entry)
> > * Carries out the common pattern of freeing an entry's zsmalloc allocation,
> > * freeing the entry itself, and decrementing the number of stored pages.
> > */
> > -static void zswap_entry_free(struct zswap_entry *entry)
> > +void zswap_entry_free(struct zswap_entry *entry)
> > {
> > zswap_lru_del(&zswap_list_lru, entry);
> > zs_free(entry->pool->zs_pool, entry->handle);
> > @@ -994,16 +995,21 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> > struct swap_info_struct *si;
> > int ret = 0;
> >
> > + /* try to allocate swap cache folio */
> > si = get_swap_device(swpentry);
> > if (!si)
> > return -EEXIST;
> >
> > + /*
> > + * Vswap entries have no physical backing - writeback would fail
> > + * and SIGBUS the caller. Bail before we waste a swap-cache folio
> > + * allocation.
> > + */
>
> Seems like this comment belongs in the previous patch, and the other
> comment movement is undoing what last patch did.

Yeah this comment belongs to the first patch. I added it after the
fact but commit to the second patch.

TBH, the first patch kinda not do much. It just declares a new special
struct swap_info_struct, with some helpers and checks, but it's not
hooked to any allocation path. Logically it should be squashed into
this patch, but this patch is already 600 LoC, lol.

>
> > if (si->flags & SWP_VSWAP) {
> > put_swap_device(si);
> > return -EINVAL;
> > }
> >
> > - /* try to allocate swap cache folio */
> > mpol = get_task_policy(current);
> > folio = swap_cache_alloc_folio(swpentry, GFP_KERNEL, BIT(0), NULL, mpol,
> > NO_INTERLEAVE_INDEX);
> > @@ -1416,25 +1422,25 @@ static bool zswap_store_page(struct page *page,
> > if (!zswap_compress(page, entry, pool))
> > goto compress_failed;
> >
> > - old = xa_store(swap_zswap_tree(page_swpentry),
> > - swp_offset(page_swpentry),
> > - entry, GFP_KERNEL);
> > - if (xa_is_err(old)) {
> > - int err = xa_err(old);
> > + if (is_vswap_entry(page_swpentry)) {
> > + vswap_zswap_store(page_swpentry, entry);
> > + } else {
> > + old = xa_store(swap_zswap_tree(page_swpentry),
> > + swp_offset(page_swpentry),
> > + entry, GFP_KERNEL);
> > + if (xa_is_err(old)) {
> > + int err = xa_err(old);
> > +
> > + WARN_ONCE(err != -ENOMEM,
> > + "unexpected xarray error: %d\n", err);
> > + zswap_reject_alloc_fail++;
> > + goto store_failed;
> > + }
> >
> > - WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
> > - zswap_reject_alloc_fail++;
> > - goto store_failed;
> > + if (old)
> > + zswap_entry_free(old);
> > }
> >
> > - /*
> > - * We may have had an existing entry that became stale when
> > - * the folio was redirtied and now the new version is being
> > - * swapped out. Get rid of the old.
> > - */
> > - if (old)
> > - zswap_entry_free(old);
> > -
> > /*
> > * The entry is successfully compressed and stored in the tree, there is
> > * no further possibility of failure. Grab refs to the pool and objcg,
> > @@ -1487,6 +1493,7 @@ bool zswap_store(struct folio *folio)
> > struct mem_cgroup *memcg = NULL;
> > struct zswap_pool *pool;
> > bool ret = false;
> > + bool partial_store = false;
> > long index;
> >
> > VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > @@ -1524,8 +1531,10 @@ bool zswap_store(struct folio *folio)
> > for (index = 0; index < nr_pages; ++index) {
> > struct page *page = folio_page(folio, index);
> >
> > - if (!zswap_store_page(page, objcg, pool))
> > + if (!zswap_store_page(page, objcg, pool)) {
> > + partial_store = index > 0;
> > goto put_pool;
> > + }
> > }
> >
> > if (objcg)
> > @@ -1548,7 +1557,9 @@ bool zswap_store(struct folio *folio)
> > * offsets corresponding to each page of the folio. Otherwise,
> > * writeback could overwrite the new data in the swapfile.
> > */
> > - if (!ret) {
> > + if (partial_store && is_vswap_entry(swp))
> > + folio_release_vswap_backing(folio);
>
> Hmm the above should also only happen in the !ret case, but that's not
> obvious from the code here. I think all of this should go under if
> (!ret), but maybe reverse the polarity to avoid the indentation?

Yeah that's just me avoiding indentation lol. But yes, it only happens
in !ret case:

>
> if (ret)
> return ret;
>
> if (is_vswap_entry(swp)) {
> if (partial_store)
> folio_release_vswap_backing(folio);
> return ret;
> }
>
> ...
>
> Alternatively you can move the check_old code for xarray into a helper
> and do:
>
> if (!ret) {
> if (is_vswap_entry(swp)) {
> if (partial_store)
> folio_release_vswap_backing(folio);
> } else {
> zswap_free_old_xa_entries(swp, nr_pages)
> }
> }

Yup! I can switch to this if you think it's cleaner.

>
> Also, I think you can probably drop partial_store and check the index
> directly here.

Ah yeah. That's true!

>
> > + else if (!ret && !is_vswap_entry(swp)) {
> > unsigned type = swp_type(swp);
> > pgoff_t offset = swp_offset(swp);
> > struct zswap_entry *entry;
> > @@ -1588,8 +1599,7 @@ bool zswap_store(struct folio *folio)
> > int zswap_load(struct folio *folio)
> > {
> > swp_entry_t swp = folio->swap;
> > - pgoff_t offset = swp_offset(swp);
> > - struct xarray *tree = swap_zswap_tree(swp);
> > + struct swap_info_struct *si = __swap_entry_to_info(swp);
> > struct zswap_entry *entry;
> >
> > VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > @@ -1599,16 +1609,25 @@ int zswap_load(struct folio *folio)
> > return -ENOENT;
> >
> > /*
> > - * Large folios should not be swapped in while zswap is being used, as
> > - * they are not properly handled. Zswap does not properly load large
> > - * folios, and a large folio may only be partially in zswap.
> > + * zswap_load() does not support large folios. For non-vswap
> > + * entries this is unexpected on the swapin path: WARN and
> > + * sigbus. For vswap entries __swap_cache_add_check() has already
> > + * filtered out ZSWAP-backed THPs under the cluster lock, so the
> > + * large folio here is zero- or phys-backed; return -ENOENT to
> > + * fall through to the phys/zero IO path.
>
> Hmm should we start simple and avoid THP swapin for vswap initially?
>
> IIUC, it isn't really vswap specific. Even without vswap, it's possible
> that an entire folio is on-disk, not in zswap, in which case THP swap
> should be allowed.
>
> I assume it's not common for zswap to be enabled and an entire THP worth
> of pages are not in zswap, so maybe we can add this later?

I was thinking of removing it altogether haha. Are we even doing THP
swap in for non-sync IO devices?

if (!folio) {
/* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices */
if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
folio = swapin_sync(entry, GFP_HIGHUSER_MOVABLE,
[...]
else
folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf);

So I guess it's primarily zram that does THP swap in here? on
non-SWP_SYNCHRONOUS_IO devices, seems like we only do "THP swapin" if
we catch the page in swap cache (minor page fault). :) Will zram users
like vswap?

OTOH, zswap might be getting THP zswap-in support soon, so it's not
just zram backend that cares about these kinds of check? :)

Or maybe I can keep it, but separate it from this big patch to make it
easier to review :) Lemme play with it.

>
> > */
> > - if (WARN_ON_ONCE(folio_test_large(folio))) {
> > - folio_unlock(folio);
> > - return -EINVAL;
> > + if (folio_test_large(folio)) {
> > + if (WARN_ON_ONCE(!swap_is_vswap(si))) {
> > + folio_unlock(folio);
> > + return -EINVAL;
> > + }
> > + return -ENOENT;
> > }
> >
> > - entry = xa_load(tree, offset);
> > + if (swap_is_vswap(si))
> > + entry = vswap_zswap_load(swp);
> > + else
> > + entry = xa_load(swap_zswap_tree(swp), swp_offset(swp));
> > if (!entry)
> > return -ENOENT;
> >
> > @@ -1623,16 +1642,14 @@ int zswap_load(struct folio *folio)
> > if (entry->objcg)
> > count_objcg_events(entry->objcg, ZSWPIN, 1);
> >
> > - /*
> > - * We are reading into the swapcache, invalidate zswap entry.
> > - * The swapcache is the authoritative owner of the page and
> > - * its mappings, and the pressure that results from having two
> > - * in-memory copies outweighs any benefits of caching the
> > - * compression work.
> > - */
> > folio_mark_dirty(folio);
> > - xa_erase(tree, offset);
> > - zswap_entry_free(entry);
> > +
> > + if (swap_is_vswap(si)) {
> > + folio_release_vswap_backing(folio);
>
> Is there any advantage to calling folio_release_vswap_backing() over
> zswap_entry_free()? Seems like __vswap_release_backing() ends up just
> calling zswap_entry_free() -- and I don't see any vswap-specific state
> being cleaned up.
>
> I wonder if the zswap code should call zswap_entry_free() directly? Same
> goes for the call in zswap_store() above.

Most just not repeating the vtable lookup-and-lock and what not. :)
The pattern is repeated the third time in swapoff when I allow phys
swap to be the backend of vswap in the next patch so I figure probably
should add some helper.

>
> > + } else {
> > + xa_erase(swap_zswap_tree(swp), swp_offset(swp));
> > + zswap_entry_free(entry);
> > + }
> >
> > folio_unlock(folio);
> > return 0;
> > --
> > 2.53.0-Meta
> >