Re: [RFC PATCH 04/14] mm: swap: swap cache support for virtualized swap
From: Nhat Pham
Date: Tue Apr 08 2025 - 11:52:13 EST
On Tue, Apr 8, 2025 at 8:34 AM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
>
> On Tue, Apr 8, 2025 at 8:00 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> >
> > On Mon, Apr 07, 2025 at 04:42:05PM -0700, Nhat Pham wrote:
> > > Currently, the swap cache code assumes that the swap space is of a fixed
> > > size. The virtual swap space is dynamically sized, so the existing
> > > partitioning code cannot be easily reused. A dynamic partitioning is
> > > planned, but for now keep the design simple and just use a flat
> > > swapcache for vswap.
> > >
> > > Since the vswap's implementation has begun to diverge from the old
> > > implementation, we also introduce a new build config
> > > (CONFIG_VIRTUAL_SWAP). Users who do not select this config will get the
> > > old implementation, with no behavioral change.
> > >
> > > Signed-off-by: Nhat Pham <nphamcs@xxxxxxxxx>
> > > ---
> > > mm/Kconfig | 13 ++++++++++
> > > mm/swap.h | 22 ++++++++++------
> > > mm/swap_state.c | 68 +++++++++++++++++++++++++++++++++++++++++--------
> > > 3 files changed, 85 insertions(+), 18 deletions(-)
> > >
> > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > index 1b501db06417..1a6acdb64333 100644
> > > --- a/mm/Kconfig
> > > +++ b/mm/Kconfig
> > > @@ -22,6 +22,19 @@ menuconfig SWAP
> > > used to provide more virtual memory than the actual RAM present
> > > in your computer. If unsure say Y.
> > >
> > > +config VIRTUAL_SWAP
> > > + bool "Swap space virtualization"
> > > + depends on SWAP
> > > + default n
> > > + help
> > > + When this is selected, the kernel is built with the new swap
> > > + design. This will allow us to decouple the swap backends
> > > + (zswap, on-disk swapfile, etc.), and save disk space when we
> > > + use zswap (or the zero-filled swap page optimization).
> > > +
> > > + There might be more lock contentions with heavy swap use, since
> > > + the swap cache is no longer range partitioned.
> > > +
> > > config ZSWAP
> > > bool "Compressed cache for swap pages"
> > > depends on SWAP
> > > diff --git a/mm/swap.h b/mm/swap.h
> > > index d5f8effa8015..06e20b1d79c4 100644
> > > --- a/mm/swap.h
> > > +++ b/mm/swap.h
> > > @@ -22,22 +22,27 @@ void swap_write_unplug(struct swap_iocb *sio);
> > > int swap_writepage(struct page *page, struct writeback_control *wbc);
> > > void __swap_writepage(struct folio *folio, struct writeback_control *wbc);
> > >
> > > -/* linux/mm/swap_state.c */
> > > -/* One swap address space for each 64M swap space */
> > > +/* Return the swap device position of the swap slot. */
> > > +static inline loff_t swap_slot_pos(swp_slot_t slot)
> > > +{
> > > + return ((loff_t)swp_slot_offset(slot)) << PAGE_SHIFT;
> > > +}
> >
> > In the same vein as the previous email, please avoid mixing moves,
> > renames and new code as much as possible. This makes it quite hard to
> > follow what's going on.
> >
> > I think it would be better if you structure the series as follows:
> >
> > 1. Prep patches. Separate patches for moves, renames, new code.
> >
> > 3. mm: vswap
> > - config VIRTUAL_SWAP
> > - mm/vswap.c with skeleton data structures, init/exit, Makefile hookup
> >
> > 4. (temporarily) flatten existing address spaces
> >
> > IMO you can do the swapcache and zswap in one patch
> >
> > 5+. conversion patches
> >
> > Grow mm/vswap.c as you add discrete components like the descriptor
> > allocator, swapoff locking, the swap_cgroup tracker etc.
> >
> > You're mostly doing this part already. But try to order them by
> > complexity and on a "core to periphery" gradient. I.e. swapoff
> > locking should probably come before cgroup stuff.
> >
> > Insert move and rename patches at points where they make the most
> > sense. I.e. if they can be understood in the current upstream code
> > already, put them with step 1 prep patches. If you find a move or a
> > rename can only be understood in the context of one of the components,
> > put them in a prep patch right before that one.
>
> Makes sense, yeah! I'll try to avoid mixing moves/renames/new code as
> much as I can.
>
> >
> > > @@ -260,6 +269,28 @@ void delete_from_swap_cache(struct folio *folio)
> > > folio_ref_sub(folio, folio_nr_pages(folio));
> > > }
> > >
> > > +#ifdef CONFIG_VIRTUAL_SWAP
> > > +void clear_shadow_from_swap_cache(int type, unsigned long begin,
> > > + unsigned long end)
> > > +{
> > > + swp_slot_t slot = swp_slot(type, begin);
> > > + swp_entry_t entry = swp_slot_to_swp_entry(slot);
> > > + unsigned long index = swap_cache_index(entry);
> > > + struct address_space *address_space = swap_address_space(entry);
> > > + void *old;
> > > + XA_STATE(xas, &address_space->i_pages, index);
> > > +
> > > + xas_set_update(&xas, workingset_update_node);
> > > +
> > > + xa_lock_irq(&address_space->i_pages);
> > > + xas_for_each(&xas, old, entry.val + end - begin) {
> > > + if (!xa_is_value(old))
> > > + continue;
> > > + xas_store(&xas, NULL);
> > > + }
> > > + xa_unlock_irq(&address_space->i_pages);
> >
> > I don't think you need separate functions for this, init, exit etc. if
> > you tweak the macros to resolve to one tree. The current code already
> > works if swapfiles are smaller than SWAP_ADDRESS_SPACE_PAGES and there
> > is only one tree, after all.
>
> For clear_shadow_from_swap_cache(), I think I understand what you want
> - keep clear_shadow_from_swap_cache() the same for two
> implementations, but at caller sites, have the callers themselves
> determine the range in swap cache (i.e (begin, end)).
>
> I'm a bit confused with init and exit, but I assume there is a way to
> do it for them as well.
>
> I will note though, that it might increase the number of ifdefs
> sections (or alternatively, IS_ENABLED() checks), because these
> functions are called in different contexts for the two
> implementations:
>
> 1. init and exit are called in swapon/swapoff in the old
> implementation. They are called in swap initialization in the virtual
> swap implementation.
>
> 2. Similarly, we clear swap cache shadows when we free physical swap
> slots in the old implementation, and when we free virtual swap slots
> in the new implementation,
>
> I think it is good actually, because it makes the difference explicit
> rather than implicit. Also, it helps us know exactly which code block
> to target when we unify the two implementations :) Just putting it out
> there.
Actually, I think I was confused.
At this stage, we have no real difference in the implementations yet -
it's purely single tree vs multiple trees. So you're right - we
shouldn't even need two implementations of the code...
I'll fix this.
>
> >
> > This would save a lot of duplication and keep ifdefs more confined.