RE: [PATCH v12 22/23] mm: zswap: zswap_store() will process a large folio in batches.

From: Sridhar, Kanchana P

Date: Mon Oct 13 2025 - 13:47:16 EST

> -----Original Message-----
> From: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
> Sent: Friday, October 3, 2025 12:11 PM
> To: Yosry Ahmed <yosry.ahmed@xxxxxxxxx>
> Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
> hannes@xxxxxxxxxxx; nphamcs@xxxxxxxxx; chengming.zhou@xxxxxxxxx;
> usamaarif642@xxxxxxxxx; ryan.roberts@xxxxxxx; 21cnbao@xxxxxxxxx;
> ying.huang@xxxxxxxxxxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx;
> senozhatsky@xxxxxxxxxxxx; sj@xxxxxxxxxx; kasong@xxxxxxxxxxx; linux-
> crypto@xxxxxxxxxxxxxxx; herbert@xxxxxxxxxxxxxxxxxxx;
> davem@xxxxxxxxxxxxx; clabbe@xxxxxxxxxxxx; ardb@xxxxxxxxxx;
> ebiggers@xxxxxxxxxx; surenb@xxxxxxxxxx; Accardi, Kristen C
> <kristen.c.accardi@xxxxxxxxx>; Gomes, Vinicius <vinicius.gomes@xxxxxxxxx>;
> Feghali, Wajdi K <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh
> <vinodh.gopal@xxxxxxxxx>; Sridhar, Kanchana P
> <kanchana.p.sridhar@xxxxxxxxx>
> Subject: RE: [PATCH v12 22/23] mm: zswap: zswap_store() will process a
> large folio in batches.
>
>
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosry.ahmed@xxxxxxxxx>
> > > Sent: Wednesday, October 1, 2025 9:19 AM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
> > > Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
> > > hannes@xxxxxxxxxxx; nphamcs@xxxxxxxxx;
> chengming.zhou@xxxxxxxxx;
> > > usamaarif642@xxxxxxxxx; ryan.roberts@xxxxxxx; 21cnbao@xxxxxxxxx;
> > > ying.huang@xxxxxxxxxxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx;
> > > senozhatsky@xxxxxxxxxxxx; sj@xxxxxxxxxx; kasong@xxxxxxxxxxx; linux-
> > > crypto@xxxxxxxxxxxxxxx; herbert@xxxxxxxxxxxxxxxxxxx;
> > > davem@xxxxxxxxxxxxx; clabbe@xxxxxxxxxxxx; ardb@xxxxxxxxxx;
> > > ebiggers@xxxxxxxxxx; surenb@xxxxxxxxxx; Accardi, Kristen C
> > > <kristen.c.accardi@xxxxxxxxx>; Gomes, Vinicius
> > <vinicius.gomes@xxxxxxxxx>;
> > > Feghali, Wajdi K <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh
> > > <vinodh.gopal@xxxxxxxxx>
> > > Subject: Re: [PATCH v12 22/23] mm: zswap: zswap_store() will process a
> > > large folio in batches.
> [...]
> > >
> > > > + */
> > > > +static __always_inline int zswap_entries_cache_alloc_batch(void
> > > **entries,
> > > > + unsigned int
> > > nr_entries,
> > > > + gfp_t gfp)
> > > > +{
> > > > + return kmem_cache_alloc_bulk(zswap_entry_cache, gfp, nr_entries,
> > > entries);
> > >
> > > We currently use kmem_cache_alloc_node() in
> zswap_entry_cache_alloc()
> > to
> > > allocate the entry on the same node as the compressed page. We use
> > > entry_to_nid() to get the node for LRU operations.
> > >
> > > This breaks that assumption.
> >
> > You bring up a good point. I was looking at the code in slub.c and my
> > understanding thus far is that both, bulk allocations and
> > kmem_cache_alloc_node()
> > allocations are made from a per-CPU "cpu_slab" that is allocated by SLUB.
> >
> > IIUC, the concern you are raising is in the mainline, the entry is allocated on
> > the same node as the compressed page, and gets added to the LRU list of
> that
> > node. IOW, the node to which the compressed page belongs is the one to
> > whose
> > LRU the entry will be added.
> >
> > With this patch, with kmem_cache_alloc_bulk(), the entry will be created on
> > the per-CPU slab of the CPU on which zswap_store() is called and will be
> > added to the LRU of that per-CPU slab's NUMA node. Hence, the end result
> > could potentially be that the zswap_entry for a page could potentially be
> > on a different NUMA node/memcg than the page's NUMA node.
> >
> > This is my thinking as to how this will impact the zswap shrinker:
> >
> > 1) memcg shrinker: if the memcg the entry ends up in is on the
> zswap_list_lru,
> > the entry will be written back.
> > 2) Global shrinker: will cycle through all memcg's that have pages in the
> > zswap_list_lru, and the entry will be written back.
> >
> > Based on this, it is not clear to me if there is a problem, and would like to
> > request you, Nhat and others to provide insights as well.
> >
> > Interestingly, most of the code in slub.c has unlikely(!node_match(slab,
> > node)).
> > Does this imply some higher level mm slab allocation requirements?
> >
> > I am Ok with just calling zswap_entry_cache_alloc() for "nr_pages" if we
> > think this would be more correct.
>
> I wanted to share some updates and summarize my understanding from
> looking some more at slub.c. The "zswap_entry_cache" kmem_cache
> has slab memory created for each node. The main change with v12 is:
>
> before:
> The page being compressed and its zswap_entry are in the same memcg,
> thus, implicitly on the same slab node.
>
> v12:
> The page being compressed and its zswap_entry could be in different
> memcgs if the process that owns the page gets migrated to a different
> node.
>
> Impact to zswap LRU list vis-à-vis the memcg shrinker:
>
> before:
> The original NUMA node memcg [hierarchy] would need to face memory
> pressure for the page's zswap_entry to be written back. Hence it is possible
> that the node on which the process is currently running may not see the
> benefit of memory being freed up.
>
> v12:
> The memcg whose node on which the process was running when the page
> was compressed would have to face memory pressure for the page's
> zswap_entry
> to be written back. This node will see the benefit of memory being freed up
> due
> to writeback. If the process has migrated to a different node than the one
> on which it was running when the page was compressed, we have the same
> issue
> as "before".
>
> Is my understanding correct? Please let me know if I am missing something.
>
> The bulk allocation does improve batching performance, especially with 64K
> folios:
>
> kernel_compile, 64K folios, deflate-iaa:
> ========================================
> mm-unstable-9-18-2025 v12 without bulk alloc/free
> ------------------------------------------------------------------------------
> real_sec 874.74 821.59 870.82
> sys_sec 3,834.35 3,791.12 3,794.06
> ------------------------------------------------------------------------------
>
> kernel_compile, 64K folios, zstd:
> =================================
> mm-unstable-9-18-2025 v12 without bulk alloc/free
> ------------------------------------------------------------------------------
> real_sec 925.08 853.14 899.40
> sys_sec 5,318.65 5,172.23 5,415.20
> ------------------------------------------------------------------------------
>
> kernel_compile, PMD folios, deflate-iaa:
> ========================================
> mm-unstable-9-18-2025 v12 without bulk alloc/free
> ------------------------------------------------------------------------------
> real_sec 808.10 794.85 809.33
> sys_sec 4,351.01 4,266.95 4,169.07
> ------------------------------------------------------------------------------
>
> kernel_compile, PMD folios, zstd:
> =================================
> mm-unstable-9-18-2025 v12 without bulk alloc/free
> ------------------------------------------------------------------------------
> real_sec 848.06 845.42 836.06
> sys_sec 5,898.58 5,741.31 5,795.75
> ------------------------------------------------------------------------------
>
> Based on this, and if my understanding of the v12 change's impact on the
> zswap shrinker is correct, I am not sure if the change in the assumption
> is necessarily a concern.
>
> >
> > >
> > > > +}
> > > > +
> > > > +static __always_inline void zswap_entries_cache_free_batch(void
> > > **entries,
> > > > + unsigned int
> > > nr_entries)
> > > > +{
> > > > + kmem_cache_free_bulk(zswap_entry_cache, nr_entries, entries);
> > > > +}
> [...]
> > > > +static bool zswap_store_pages(struct folio *folio,
> > > > + long start,
> > > > + long end,
> > > > + struct obj_cgroup *objcg,
> > > > + struct zswap_pool *pool,
> > > > + int node_id,
> > > > + bool folio_wb)
> > > > {
> > > > - swp_entry_t page_swpentry = page_swap_entry(page);
> > > > - struct zswap_entry *entry, *old;
> > > > -
> > > > - /* allocate entry */
> > > > - entry = zswap_entry_cache_alloc(GFP_KERNEL, page_to_nid(page));
> > > > - if (!entry) {
> > > > - zswap_reject_kmemcache_fail++;
> > > > - return false;
> > > > + struct zswap_entry *entries[ZSWAP_MAX_BATCH_SIZE];
> > > > + u8 i, store_fail_idx = 0, nr_pages = end - start;
> > > > +
> > > > + VM_WARN_ON_ONCE(nr_pages > ZSWAP_MAX_BATCH_SIZE);
> > > > +
> > > > + if (unlikely(!zswap_entries_cache_alloc_batch((void **)&entries[0],
> > > > + nr_pages, GFP_KERNEL)))
> > > {
> > > > + for (i = 0; i < nr_pages; ++i) {
> > > > + entries[i] = zswap_entry_cache_alloc(GFP_KERNEL,
> > > node_id);
> > > > +
> > > > + if (unlikely(!entries[i])) {
> > > > + zswap_reject_kmemcache_fail++;
> > > > + /*
> > > > + * While handling this error, we only need to
> > > > + * call zswap_entries_cache_free_batch() for
> > > > + * entries[0 .. i-1].
> > > > + */
> > > > + nr_pages = i;
> > > > + goto store_pages_failed;
> > > > + }
> > >
> > > Is it okay to use kmem_cache_free_bulk() to free slab objects that were
> > > not allocated with kmem_cache_alloc_bulk()?
> >
> > I recall verifying that it should be Ok, but can check again.
> >
>
> I verified the code again, and yes, it is Ok for slab objects allocated by
> either kmem_cache_alloc_bulk() or kmem_cache_alloc_node() to be
> freed by calling kmem_cache_free_bulk(). kmem_cache_free_bulk()
> opportunistically looks to create freelists from the list of slab objects
> to then "bulk transfer" them to the slab's freelists, with optimizations
> in synchronization as compared to calling kmem_cache_free().
>
> I verified this with usemem30 with the following code change in
> zswap_store_pages(), and saw no issues with deflate-iaa/zstd:
>
> for each of nr_pages: kmem_cache_alloc_node()
>
> kmem_cache_free_bulk()
>
> kmem_cache_alloc_bulk()
>
> kmem_cache_free_bulk()
>
> kmem_cache_alloc _bulk()
>
> [proceed]
>
> Yosry, please let me know how I should proceed and if there are
> other experiments that I can run to verify this change to bulk alloc/free
> to address any concerns.

Hi Yosry,

Just wanted to check in to get your suggestions on next steps as I start
working on v13.

Thanks for taking the time to provide code review comments!

Best regards,
Kanchana

>
> Thanks,
> Kanchana