Re: [RFC PATCH] mm: Avoiding split large folios if swap has no space
From: Barry Song
Date: Fri Jun 26 2026 - 02:19:19 EST
On Thu, Jun 25, 2026 at 9:46 PM David Hildenbrand (Arm)
<david@xxxxxxxxxx> wrote:
>
> On 6/25/26 15:36, Johannes Weiner wrote:
> > On Thu, Jun 25, 2026 at 09:49:56AM +0200, David Hildenbrand (Arm) wrote:
> >>>
> >>> I don't quite understand you. get_nr_swap_pages() returns
> >>> nr_swap_pages, which increases or decreases as swap is allocated or
> >>> freed. I guess it just reflects how many swaps we currently have
> >>> available?
> >>
> >> Indeed, I was confused by the function name it's "free swap pages". So all goof :)
> >>
> >>>
> >>>
> >>> Yep. The tricky part is that mem_cgroup_try_charge_swap() cannot
> >>> return how much swap quota is available in the memcg. Do you prefer to
> >>> add an output argument to mem_cgroup_try_charge_swap() to expose
> >>> that
> >> That would probably be cleanest, if that is easily possible. We would want to
> >> get memcg maintainer feedback on that.
> >>
> >> @memcg folks: we'd like to know whether splitting a large folio would make
> >> mem_cgroup_try_charge_swap() succeed on a split (smaller) part, to distinguish
> >> "there is no way we can swap out anything, don't split" vs. "we could swap out,
> >> split".
> >
> > It's technically doable, but is this worth the bother? The remaining
> > headroom is less than a large folio. You can split this one, but you
> > cannot even swap out all of its subpages anymore?
>
> I was asking myself the same, but when we think in terms of THPs on arm64 64k
> we're in the range of double-digit MiBs.
Yep. But we haven't enabled thp_swp for 64KB base pages yet; there
is an ongoing attempt[1].
The current concern is that such large folios may incur significant
latency for I/O or compression, and reclaim could end up waiting
for a long time.
For huge folios, maybe we should split them into 2MB folios
instead of splitting all the way down to base pages when swapping
out.
That way, we could allow architectures with "huge" large folios to swap
out using THP_SWAP?
BTW, PowerPC is enabling THP_SWPOUT for up to 16MB THP[2].
[1] https://lore.kernel.org/linux-mm/20251226063759.4020782-2-tongweilin@xxxxxxxxxxxxxxxxx/
[2] https://lore.kernel.org/linux-mm/cover.1781843449.git.ritesh.list@xxxxxxxxx/
>
> > From the cgroup
> > side, we don't need the limit to be obeyed this rigidly. We overcharge
> > temporarily in other places if it's convenient to do so. A fuzz factor
> > around the limit is acceptable.
>
> Thanks for that information.
>
> >
> > But if you still want to do it, here is how:
> >
> > The page_counter_try_charge() in __mem_cgroup_try_charge_swap() walks
> > the hierarchy upwards. If it fails, it will store the first level that
> > failed against its limit. You can do the mem_cgroup_margin() math
> > against this counter to determine headroom. An ancestor *could* be
> > more restrictive, so you need to finish the hierarchy walk to the root
> > and use the min() of all the swap.max - page_counter_read(swap). Then
> > return that in a return argument from __mem_cgroup_try_charge_swap().
>
It seems it is exactly what mem_cgroup_get_nr_swap_pages() is doing:
long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg)
{
long nr_swap_pages = get_nr_swap_pages();
if (mem_cgroup_disabled() || do_memsw_account())
return nr_swap_pages;
for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg))
nr_swap_pages = min_t(long, nr_swap_pages,
READ_ONCE(memcg->swap.max) -
page_counter_read(&memcg->swap));
return nr_swap_pages;
}
> Thanks! @Barry, up to you if we want to implement that right away or if we're
> simply going to assume that if charging fails, not worth splitting (changing the
> existing handling IIUC).
Thanks very much to Johannes and David for the help.
Hi Johannes, I wonder if you're comfortable with something like
the code below (just as a concept):
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 8f0f68e245ba..be4f86271c0d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -573,12 +573,12 @@ static inline void
folio_throttle_swaprate(struct folio *folio, gfp_t gfp)
#endif
#if defined(CONFIG_MEMCG) && defined(CONFIG_SWAP)
-int __mem_cgroup_try_charge_swap(struct folio *folio);
-static inline int mem_cgroup_try_charge_swap(struct folio *folio)
+int __mem_cgroup_try_charge_swap(struct folio *folio, long *left_space);
+static inline int mem_cgroup_try_charge_swap(struct folio *folio,
long *left_space)
{
if (mem_cgroup_disabled())
return 0;
- return __mem_cgroup_try_charge_swap(folio);
+ return __mem_cgroup_try_charge_swap(folio, left_space);
}
extern void __mem_cgroup_uncharge_swap(unsigned short id, unsigned
int nr_pages);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 56cd4af08232..c8c9a10befad 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5578,7 +5578,7 @@ int __init mem_cgroup_init(void)
*
* Returns 0 on success, -ENOMEM on failure.
*/
-int __mem_cgroup_try_charge_swap(struct folio *folio)
+int __mem_cgroup_try_charge_swap(struct folio *folio, long *left_space)
{
unsigned int nr_pages = folio_nr_pages(folio);
struct swap_cluster_info *ci;
@@ -5611,6 +5611,10 @@ int __mem_cgroup_try_charge_swap(struct folio *folio)
memcg_memory_event(memcg, MEMCG_SWAP_MAX);
memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
mem_cgroup_private_id_put(memcg, nr_pages);
+ if (folio_test_large(folio))
+ *left_space = mem_cgroup_get_nr_swap_pages(memcg);
+ else
+ *left_space = 0;
return -ENOMEM;
}
mod_memcg_state(memcg, MEMCG_SWAP, nr_pages);
Thanks
Barry