Re: [PATCH] mm, swap: avoid leaving unused extend table after alloc race

From: Kairui Song

Date: Fri May 15 2026 - 04:32:58 EST


Forwarding sashiko review:

> > @@ -1451,11 +1453,27 @@ static int swap_extend_table_alloc(struct swap_info_struct *si,
> > return -ENOMEM;
> >
> > spin_lock(&ci->lock);
> > - if (!ci->extend_table)
> > - ci->extend_table = table;
> > - else
> > - kfree(table);
> > + /*
> > + * Extend table allocation requires releasing ci lock first so it's
> > + * possible that the slot has been freed, no longer overflowed, or
> > + * a concurrent extend table allocation has already succeeded, so
> > + * the allocation is no longer needed.
> > + */
> > + if (!cluster_table_is_alloced(ci))
> > + goto out_free;
> > + count = swp_tb_get_count(__swap_table_get(ci, ci_off));
> > + if (count < (SWP_TB_COUNT_MAX - 1))
> > + goto out_free;
>
> Is it possible for the extend table to still be leaked here due to a
> subsequent race?
> For example, if swap_retry_table_alloc() publishes the table because count is
> still SWP_TB_COUNT_MAX - 1:
> 1. Thread A in swap_retry_table_alloc() publishes the extend_table.
> 2. Before Thread A reacquires the page table lock and retries, Thread B
> unmaps the PTE.
> 3. Thread B calls swap_free(), decrementing the count from
> SWP_TB_COUNT_MAX - 1 to SWP_TB_COUNT_MAX - 2.
> 4. __swap_cluster_put_entry() only frees the extend table when the count
> decrements from SWP_TB_COUNT_MAX to SWP_TB_COUNT_MAX - 1, so the table
> remains attached.
> 5. Thread A resumes, sees pte_none, and skips the entry.
> Since the count never reached SWP_TB_COUNT_MAX, does the table remain
> permanently attached until the cluster is freed, triggering a WARN_ON_ONCE
> and a memory leak?

That seems possible indeed. I can adjust this a bit and avoid that
potential race too. The change is minor, I will send V2 shortly.

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 451d20bb9f47..365b4caeef4b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1536,13 +1536,21 @@ static void __swap_cluster_put_entry(struct
swap_cluster_info *ci,
if (count == (SWP_TB_COUNT_MAX - 1)) {
ci->extend_table[ci_off] = 0;
__swap_table_set(ci, ci_off,
__swp_tb_mk_count(swp_tb, count));
- swap_extend_table_try_free(ci);
} else {
ci->extend_table[ci_off] = count;
}
} else {
__swap_table_set(ci, ci_off, __swp_tb_mk_count(swp_tb,
--count));
}
+
+ /*
+ * `SWP_TB_COUNT_MAX - 1` triggers extend table allocation. If the
+ * count was above that then the extend table is no longer needed. And
+ * if we just put the count value from that value, it's possible that
+ * a pending dup just attached a extend table.
+ */
+ if (unlikely(count == SWP_TB_COUNT_MAX - 2 || count ==
SWP_TB_COUNT_MAX - 1))
+ swap_extend_table_try_free(ci);
}