Re: [PATCH] zswap/zsmalloc: prefer the the original page's node for compressed data
From: Johannes Weiner
Date: Wed Apr 02 2025 - 15:58:01 EST
On Wed, Apr 02, 2025 at 12:11:45PM -0700, Nhat Pham wrote:
> Currently, zsmalloc, zswap's backend memory allocator, does not enforce
> any policy for the allocation of memory for the compressed data,
> instead just adopting the memory policy of the task entering reclaim,
> or the default policy (prefer local node) if no such policy is
> specified. This can lead to several pathological behaviors in
> multi-node NUMA systems:
>
> 1. Systems with CXL-based memory tiering can encounter the following
> inversion with zswap: the coldest pages demoted to the CXL tier
> can return to the high tier when they are zswapped out, creating
> memory pressure on the high tier.
>
> 2. Consider a direct reclaimer scanning nodes in order of allocation
> preference. If it ventures into remote nodes, the memory it
> compresses there should stay there. Trying to shift those contents
> over to the reclaiming thread's preferred node further *increases*
> its local pressure, and provoking more spills. The remote node is
> also the most likely to refault this data again. This undesirable
> behavior was pointed out by Johannes Weiner in [1].
>
> 3. For zswap writeback, the zswap entries are organized in
> node-specific LRUs, based on the node placement of the original
> pages, allowing for targeted zswap writeback for specific nodes.
>
> However, the compressed data of a zswap entry can be placed on a
> different node from the LRU it is placed on. This means that reclaim
> targeted at one node might not free up memory used for zswap entries
> in that node, but instead reclaiming memory in a different node.
>
> All of these issues will be resolved if the compressed data go to the
> same node as the original page. This patch encourages this behavior by
> having zswap pass the node of the original page to zsmalloc, and have
> zsmalloc prefer the specified node if we need to allocate new (zs)pages
> for the compressed data.
>
> Note that we are not strictly binding the allocation to the preferred
> node. We still allow the allocation to fall back to other nodes when
> the preferred node is full, or if we have zspages with slots available
> on a different node. This is OK, and still a strict improvement over
> the status quo:
>
> 1. On a system with demotion enabled, we will generally prefer
> demotions over zswapping, and only zswap when pages have
> already gone to the lowest tier. This patch should achieve the
> desired effect for the most part.
>
> 2. If the preferred node is out of memory, letting the compressed data
> going to other nodes can be better than the alternative (OOMs,
> keeping cold memory unreclaimed, disk swapping, etc.).
>
> 3. If the allocation go to a separate node because we have a zspage
> with slots available, at least we're not creating extra immediate
> memory pressure (since the space is already allocated).
>
> 3. While there can be mixings, we generally reclaim pages in
> same-node batches, which encourage zspage grouping that is more
> likely to go to the right node.
>
> 4. A strict binding would require partitioning zsmalloc by node, which
> is more complicated, and more prone to regression, since it reduces
> the storage density of zsmalloc. We need to evaluate the tradeoff
> and benchmark carefully before adopting such an involved solution.
>
> This patch does not fix zram, leaving its memory allocation behavior
> unchanged. We leave this effort to future work.
zram's zs_malloc() calls all have page context. It seems a lot easier
to just fix the bug for them as well than to have two allocation APIs
and verbose commentary?
> -static inline struct zpdesc *alloc_zpdesc(gfp_t gfp)
> +static inline struct zpdesc *alloc_zpdesc(gfp_t gfp, const int *nid)
> {
> - struct page *page = alloc_page(gfp);
> + struct page *page;
> +
> + if (nid)
> + page = alloc_pages_node(*nid, gfp, 0);
> + else {
> + /*
> + * XXX: this is the zram path. We should consider fixing zram to also
> + * use alloc_pages_node() and prefer the same node as the original page.
> + *
> + * Note that alloc_pages_node(NUMA_NO_NODE, gfp, 0) is not equivalent
> + * to allloc_page(gfp). The former will prefer the local/closest node,
> + * whereas the latter will try to follow the memory policy of the current
> + * process.
> + */
> + page = alloc_page(gfp);
> + }
>
> return page_zpdesc(page);
> }
> @@ -461,10 +476,13 @@ static void zs_zpool_destroy(void *pool)
> zs_destroy_pool(pool);
> }
>
> +static unsigned long zs_malloc_node(struct zs_pool *pool, size_t size,
> + gfp_t gfp, const int *nid);
> +
> static int zs_zpool_malloc(void *pool, size_t size, gfp_t gfp,
> - unsigned long *handle)
> + unsigned long *handle, const int nid)
> {
> - *handle = zs_malloc(pool, size, gfp);
> + *handle = zs_malloc_node(pool, size, gfp, &nid);
>
> if (IS_ERR_VALUE(*handle))
> return PTR_ERR((void *)*handle);
> }
>
>
> -/**
> - * zs_malloc - Allocate block of given size from pool.
> - * @pool: pool to allocate from
> - * @size: size of block to allocate
> - * @gfp: gfp flags when allocating object
> - *
> - * On success, handle to the allocated object is returned,
> - * otherwise an ERR_PTR().
> - * Allocation requests with size > ZS_MAX_ALLOC_SIZE will fail.
> - */
> -unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
> +static unsigned long zs_malloc_node(struct zs_pool *pool, size_t size,
> + gfp_t gfp, const int *nid)
> {
> unsigned long handle;
> struct size_class *class;
> @@ -1397,6 +1406,21 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
>
> return handle;
> }
> +
> +/**
> + * zs_malloc - Allocate block of given size from pool.
> + * @pool: pool to allocate from
> + * @size: size of block to allocate
> + * @gfp: gfp flags when allocating object
> + *
> + * On success, handle to the allocated object is returned,
> + * otherwise an ERR_PTR().
> + * Allocation requests with size > ZS_MAX_ALLOC_SIZE will fail.
> + */
> +unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
> +{
> + return zs_malloc_node(pool, size, gfp, NULL);
> +}
> EXPORT_SYMBOL_GPL(zs_malloc);
>
> static void obj_free(int class_size, unsigned long obj)