[RFC PATCH 0/2] zswap: fix placement inversion in memory tiering systems

From: Nhat Pham
Date: Sat Mar 29 2025 - 07:02:44 EST


Currently, systems with CXL-based memory tiering can encounter the
following inversion with zswap: the coldest pages demoted to the CXL
tier can return to the high tier when they are zswapped out,
creating memory pressure on the high tier.

This happens because zsmalloc, zswap's backend memory allocator, does
not enforce any memory policy. If the task reclaiming memory follows
the local-first policy for example, the memory requested for zswap can
be served by the upper tier, leading to the aformentioned inversion.

This RFC fixes this inversion by adding a new memory allocation mode
for zswap (exposed through a zswap sysfs knob), intended for
hosts with CXL, where the memory for the compressed object is requested
preferentially from the same node that the original page resides on.

With the new zswap allocation mode enabled, we should observe the
following dynamics:

1. When demotion is turned on, under reasonable conditions, zswap will
prefer CXL memory by default, since top-tier memory being reclaimed
will typically be demoted instead of swapped.

2. This should prevent reclaim on the lower tier from causing high-tier
memory pressure due to new allocations.

3. This should avoid a quiet promotion of cold memory (memory being
zswapped is cold, but is promoted when put into the zswap pool
because the memory allocated for the compressed copy comes from the
high tier).

4. However, this may actually cause pressure on the CXL tier, which may
actually result in further demotion (to swap, etc). This needs to be
tested.

I'm still testing and collecting more data, but figure I should send
this out as an RFC to spark the discussion:

1. Is this the right policy? Do we need a more complicated policy?
Should we instead go for the "lowest" node (which would require new
memory tiering API)? Or maybe trying each node from current node
to the lowest node in the hierarchy?

Also, I hack together this fix with CXL in mind, but if there are
other cases that I should also address we can explore a more general
memory allocation strategy or interface.

2. Similarly, is this the right zsmalloc API? For instance, we can build
build a full-fledged mempolicy-based API for zsmalloc, but I haven't
found a use case for it yet.

3. Assuming this is the right policy, what should be the semantics? Not
very good at naming things, so same_node_mode might not be it :)

Nhat Pham (2):
zsmalloc: let callers select NUMA node to store the compressed objects
zswap: add sysfs knob for same node mode

Documentation/admin-guide/mm/zswap.rst | 9 +++++++++
include/linux/zpool.h | 4 ++--
mm/zpool.c | 8 +++++---
mm/zsmalloc.c | 28 +++++++++++++++++++-------
mm/zswap.c | 10 +++++++--
5 files changed, 45 insertions(+), 14 deletions(-)


base-commit: 4135040c342ba080328891f1b7e523c8f2f04c58
--
2.47.1