Re: [RFC PATCH 00/14] Virtual Swap Space
From: Kairui Song
Date: Tue Apr 08 2025 - 12:28:17 EST
On Tue, Apr 8, 2025 at 7:47 AM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
>
> This RFC implements the virtual swap space idea, based on Yosry's
> proposals at LSFMMBPF 2023 (see [1], [2], [3]), as well as valuable
> inputs from Johannes Weiner. The same idea (with different
> implementation details) has been floated by Rik van Riel since at least
> 2011 (see [8]).
>
> The code attached to this RFC is purely a prototype. It is not 100%
> merge-ready (see section VI for future work). I do, however, want to show
> people this prototype/RFC, including all the bells and whistles and a
> couple of actual use cases, so that folks can see what the end results
> will look like, and give me early feedback :)
>
> I. Motivation
>
> Currently, when an anon page is swapped out, a slot in a backing swap
> device is allocated and stored in the page table entries that refer to
> the original page. This slot is also used as the "key" to find the
> swapped out content, as well as the index to swap data structures, such
> as the swap cache, or the swap cgroup mapping. Tying a swap entry to its
> backing slot in this way is performant and efficient when swap is purely
> just disk space, and swapoff is rare.
>
> However, the advent of many swap optimizations has exposed major
> drawbacks of this design. The first problem is that we occupy a physical
> slot in the swap space, even for pages that are NEVER expected to hit
> the disk: pages compressed and stored in the zswap pool, zero-filled
> pages, or pages rejected by both of these optimizations when zswap
> writeback is disabled. This is the arguably central shortcoming of
> zswap:
> * In deployments when no disk space can be afforded for swap (such as
> mobile and embedded devices), users cannot adopt zswap, and are forced
> to use zram. This is confusing for users, and creates extra burdens
> for developers, having to develop and maintain similar features for
> two separate swap backends (writeback, cgroup charging, THP support,
> etc.). For instance, see the discussion in [4].
> * Resource-wise, it is hugely wasteful in terms of disk usage, and
> limits the memory saving potentials of these optimizations by the
> static size of the swapfile, especially in high memory systems that
> can have up to terabytes worth of memory. It also creates significant
> challenges for users who rely on swap utilization as an early OOM
> signal.
>
> Another motivation for a swap redesign is to simplify swapoff, which
> is complicated and expensive in the current design. Tight coupling
> between a swap entry and its backing storage means that it requires a
> whole page table walk to update all the page table entries that refer to
> this swap entry, as well as updating all the associated swap data
> structures (swap cache, etc.).
>
>
> II. High Level Design Overview
>
> To fix the aforementioned issues, we need an abstraction that separates
> a swap entry from its physical backing storage. IOW, we need to
> “virtualize” the swap space: swap clients will work with a dynamically
> allocated virtual swap slot, storing it in page table entries, and
> using it to index into various swap-related data structures. The
> backing storage is decoupled from the virtual swap slot, and the newly
> introduced layer will “resolve” the virtual swap slot to the actual
> storage. This layer also manages other metadata of the swap entry, such
> as its lifetime information (swap count), via a dynamically allocated
> per-swap-entry descriptor:
>
> struct swp_desc {
> swp_entry_t vswap;
> union {
> swp_slot_t slot;
> struct folio *folio;
> struct zswap_entry *zswap_entry;
> };
> struct rcu_head rcu;
>
> rwlock_t lock;
> enum swap_type type;
>
> atomic_t memcgid;
>
> atomic_t in_swapcache;
> struct kref refcnt;
> atomic_t swap_count;
> };
Thanks for sharing the code, my initial idea after the discussion at
LSFMM is that there is a simple way to combine this with the "swap
table" [1] design of mine to solve the performance issue of this
series: just store the pointer of this struct in the swap table. It's
a bruteforce and glue like solution but the contention issue will be
gone.
Of course it's not a good approach, ideally the data structure can be
simplified to an entry type in the swap table. The swap table series
handles locking and synchronizations using either cluster lock
(reusing swap allocator and existing swap logics) or folio lock (kind
of like page cache). So many parts can be much simplified, I think it
will be at most ~32 bytes per page with a virtual device (including
the intermediate pointers).Will require quite some work though.
The good side with that approach is we will have a much lower memory
overhead and even better performance. And the virtual space part will
be optional, for non virtual setup the memory consumption will be only
8 bytes per page and also dynamically allocated, as discussed at
LSFMM.
So sorry that I still have a few parts undone, looking forward to
posting in about one week, eg. After this weekend it goes well. I'll
also try to check your series first to see how these can be
collaborated better.
A draft version is available here though, just in case anyone is
really anxious to see the code, I wouldn't recommend spend much effort
check it though as it may change rapidly:
https://github.com/ryncsn/linux/tree/kasong/devel/swap-unification
But the good news is the total LOC should be reduced, or at least
won't increase much, as it will unify a lot of swap infrastructures.
So things might be easier to implement after that.
[1] https://lore.kernel.org/linux-mm/CAMgjq7DHFYWhm+Z0C5tR2U2a-N_mtmgB4+idD2S+-1438u-wWw@xxxxxxxxxxxxxx/T/