[RFC PATCH 32/45] mm: page_alloc: proactive high-water trigger for SPB slab shrink

From: Rik van Riel

Date: Thu Apr 30 2026 - 16:34:51 EST


From: Rik van Riel <riel@xxxxxxxx>

The SPB slab shrinker introduced earlier in the series only fires when
__rmqueue_smallest falls all the way through to Pass 3 (about to taint
a clean SPB) or when __rmqueue_claim is about to taint one. Bare-metal
testing on a 247 GB devvm with btrfs root (rev 398, with Pass 2c) shows
this is too late: at boot+16min only 15 shrinks had fired in 6 minutes
while slab grew from 1.7 GB to 11.7 GB and tainted Normal-zone SPBs
climbed from 4 baseline to 16. The 100ms throttle (max 10 shrinks/sec
per pgdat) further capped the response rate, and the trigger placement
meant slab pressure could keep absorbing into already-tainted SPBs
without ever firing the shrinker until those SPBs were exhausted — at
which point the only remaining option is to taint a fresh clean SPB.

Two changes:

1. Add a proactive high-water trigger on the success paths of
__rmqueue_smallest's tainted-SPB passes (Pass 1 SB_TAINTED, Pass 2,
Pass 2b, Pass 2c). When a non-movable allocation consumes from a
tainted SPB whose nr_free_pages has fallen below spb_tainted_reserve
worth of pages (reserve_pageblocks * pageblock_nr_pages), queue a
slab shrink. The predicate compares total free pages rather than
whole free pageblocks (nr_free): sub-pageblock allocations and
fragmented free space don't move the pageblock count but do consume
the SPB's freeable capacity, and we can't assume slab reclaim will
produce whole pageblocks either. This makes the trigger frequency
proportional to the rate of non-movable consumption from contended
tainted SPBs, instead of firing only at the cliff edge.

2. Remove the 100ms time-based throttle from queue_spb_slab_shrink.
The throttle was redundant with queue_work()'s built-in single-flight
semantics (returns false if the work is already queued/running) and
was actively harmful: with the new high-water trigger firing per
allocation, the natural rate-limiter is the worker's runtime. The
previously-allocated spb_slab_shrink_last field is removed from
pglist_data.

queue_work() absorbs the resulting per-alloc burst at near-zero cost
(test-and-set on WORK_STRUCT_PENDING_BIT) when a pass is already in
flight, so unconditional firing on every qualifying allocation is
cheap.

Pass 4 (movable falling back to tainted) does not get the trigger:
movable consumption does not contribute to the slab pressure that taints
fresh SPBs, and Pass 4 already filters out SBs at or below reserve.
Clean-SPB success paths in Pass 1 are also untouched (clean SPBs are
not the source of the pressure).

Signed-off-by: Rik van Riel <riel@xxxxxxxxxxx>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
include/linux/mmzone.h | 7 +++---
mm/page_alloc.c | 48 ++++++++++++++++++++++++++++++++----------
2 files changed, 40 insertions(+), 15 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index acaff292140f..68892e40cd4e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1573,12 +1573,11 @@ typedef struct pglist_data {

/*
* SPB-driven slab reclaim: single work item per pgdat (shrink_slab
- * is node-scoped, so one work in-flight per node is the max), with
- * a 100ms throttle. queue_work() gives us single-flight semantics
- * for free.
+ * is node-scoped, so one work in-flight per node is the max).
+ * queue_work() gives us single-flight semantics for free — fresh
+ * triggers no-op while a pass is in progress.
*/
struct work_struct spb_slab_shrink_work;
- unsigned long spb_slab_shrink_last;
#endif
/*
* This is a per-node reserve of pages that are not available
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f2db3dd86a84..ff7755ef2b79 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2692,6 +2692,23 @@ static inline u16 spb_tainted_reserve(const struct superpageblock *sb)
return max_t(u16, SPB_TAINTED_RESERVE_MIN, sb->total_pageblocks / 32);
}

+/*
+ * High-water threshold for proactively kicking the slab shrinker. When a
+ * non-movable allocation consumes from a tainted SPB whose total free
+ * pages have fallen below spb_tainted_reserve worth of pages, queue a
+ * shrink so we start freeing slab memory before the SPB is exhausted.
+ *
+ * Compared against nr_free_pages rather than nr_free (whole pageblocks):
+ * sub-pageblock allocations and fragmented free space don't move the
+ * pageblock count, but they do consume the SPB's freeable capacity, and
+ * we can't assume slab reclaim will produce whole pageblocks either.
+ */
+static inline bool spb_below_shrink_high_water(const struct superpageblock *sb)
+{
+ return sb->nr_free_pages <
+ (unsigned long)spb_tainted_reserve(sb) * pageblock_nr_pages;
+}
+
/*
* On systems with many superpageblocks, we can afford to "write off"
* tainted superpageblocks by aggressively packing unmovable/reclaimable
@@ -2877,6 +2894,9 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
page_del_and_expand(zone, page,
order, current_order,
migratetype);
+ if (cat == SB_TAINTED &&
+ spb_below_shrink_high_water(sb))
+ queue_spb_slab_shrink(zone);
trace_mm_page_alloc_zone_locked(
page, order, migratetype,
pcp_allowed_order(order) &&
@@ -2896,6 +2916,9 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
page_del_and_expand(zone, page,
order, current_order,
migratetype);
+ if (cat == SB_TAINTED &&
+ spb_below_shrink_high_water(sb))
+ queue_spb_slab_shrink(zone);
trace_mm_page_alloc_zone_locked(
page, order, migratetype,
pcp_allowed_order(order) &&
@@ -2941,6 +2964,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
page = claim_whole_block(zone, page,
current_order, order,
migratetype, MIGRATE_MOVABLE);
+ if (spb_below_shrink_high_water(sb))
+ queue_spb_slab_shrink(zone);
trace_mm_page_alloc_zone_locked(
page, order, migratetype,
pcp_allowed_order(order) &&
@@ -2978,6 +3003,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
0, true);
if (!page)
continue;
+ if (spb_below_shrink_high_water(sb))
+ queue_spb_slab_shrink(zone);
trace_mm_page_alloc_zone_locked(
page, order, migratetype,
pcp_allowed_order(order) &&
@@ -3061,6 +3088,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
opposite_mt);
__spb_set_has_type(page,
migratetype);
+ if (spb_below_shrink_high_water(sb))
+ queue_spb_slab_shrink(zone);
trace_mm_page_alloc_zone_locked(
page, order, migratetype,
pcp_allowed_order(order) &&
@@ -9126,9 +9155,9 @@ static void queue_spb_evacuate(struct zone *zone, unsigned int order,
* tainted SPB is to shrink the slab caches whose pages live there.
*
* shrink_slab() is node-scoped, so one work item per pgdat is enough:
- * a single embedded work_struct, gated by a 100ms throttle.
- * queue_work() returns false if the work is already queued/running, so
- * we get single-flight for free.
+ * a single embedded work_struct. queue_work() returns false if the work
+ * is already queued/running, so we get single-flight for free — fresh
+ * triggers no-op until the in-flight pass completes.
*
* shrink_slab() itself is location-agnostic — it walks all registered
* shrinkers and frees objects whose backing pages may live in any
@@ -9189,10 +9218,11 @@ static void spb_slab_shrink_work_fn(struct work_struct *work)
* queue_spb_slab_shrink - schedule deferred slab shrink for SPB pressure
* @zone: zone whose tainted-SPB pool is running low
*
- * Throttled to one enqueue per 100ms per pgdat. queue_work() handles
- * single-flight: if the work is already queued or running, it returns
- * false and the throttle stamp still gets bumped (next call will be
- * no-op until the throttle elapses).
+ * Single-flight via queue_work(): if the work is already queued or
+ * running, it returns false and we no-op. There is no time-based
+ * throttle — the rate at which fresh shrink runs can fire is bounded
+ * by how fast the worker completes (one full pass freeing up to
+ * SPB_SLAB_SHRINK_TARGET_OBJS objects).
*
* Callable from any context: page allocator paths hold zone->lock,
* the SPB evacuate worker does not. queue_work() takes only the
@@ -9212,10 +9242,6 @@ static void queue_spb_slab_shrink(struct zone *zone)
if (!pgdat->evacuate_wq)
return;

- if (time_before(jiffies, pgdat->spb_slab_shrink_last + HZ / 10))
- return;
-
- pgdat->spb_slab_shrink_last = jiffies;
if (queue_work(pgdat->evacuate_wq, &pgdat->spb_slab_shrink_work))
count_vm_event(SPB_SLAB_SHRINK_QUEUED);
}
--
2.52.0