Re: [PATCH 5/5] mm: page_alloc: defrag_mode kswapd/kcompactd watermarks

From: Johannes Weiner
Date: Fri Apr 11 2025 - 11:41:58 EST


On Fri, Apr 11, 2025 at 10:19:58AM +0200, Vlastimil Babka wrote:
> On 3/13/25 22:05, Johannes Weiner wrote:
> > The previous patch added pageblock_order reclaim to kswapd/kcompactd,
> > which helps, but produces only one block at a time. Allocation stalls
> > and THP failure rates are still higher than they could be.
> >
> > To adequately reflect ALLOC_NOFRAGMENT demand for pageblocks, change
> > the watermarking for kswapd & kcompactd: instead of targeting the high
> > watermark in order-0 pages and checking for one suitable block, simply
> > require that the high watermark is entirely met in pageblocks.
>
> Hrm.

Hrm!

> > @@ -2329,6 +2329,22 @@ static enum compact_result __compact_finished(struct compact_control *cc)
> > if (!pageblock_aligned(cc->migrate_pfn))
> > return COMPACT_CONTINUE;
> >
> > + /*
> > + * When defrag_mode is enabled, make kcompactd target
> > + * watermarks in whole pageblocks. Because they can be stolen
> > + * without polluting, no further fallback checks are needed.
> > + */
> > + if (defrag_mode && !cc->direct_compaction) {
> > + if (__zone_watermark_ok(cc->zone, cc->order,
> > + high_wmark_pages(cc->zone),
> > + cc->highest_zoneidx, cc->alloc_flags,
> > + zone_page_state(cc->zone,
> > + NR_FREE_PAGES_BLOCKS)))
> > + return COMPACT_SUCCESS;
> > +
> > + return COMPACT_CONTINUE;
> > + }
>
> Wonder if this ever succeds in practice. Is high_wmark_pages() even aligned
> to pageblock size? If not, and it's X pageblocks and a half, we will rarely
> have NR_FREE_PAGES_BLOCKS cover all of that? Also concurrent allocations can
> put us below high wmark quickly and then we never satisfy this?

The high watermark is not aligned, but why does it have to be? It's a
binary condition: met or not met. Compaction continues until it's met.

NR_FREE_PAGES_BLOCKS moves in pageblock_nr_pages steps. This means
it'll really work until align_up(highmark, pageblock_nr_pages), as
that's when NR_FREE_PAGES_BLOCKS snaps above the (unaligned) mark. But
that seems reasonable, no?

The allocator side is using low/min, so we have the conventional
hysteresis between consumer and producer.

For illustration, on my 2G test box, the watermarks in DMA32 look like
this:

pages free 212057
boost 0
min 11164 (21.8 blocks)
low 13955 (27.3 blocks)
high 16746 (32.7 blocks)
promo 19537
spanned 456704
present 455680
managed 431617 (843.1 blocks)

So there are several blocks between the kblahds wakeup and sleep. The
first allocation to cut into a whole free block will decrease
NR_FREE_PAGES_BLOCK by a whole block. But subsequent allocs that fill
the remaining space won't change that counter. So the distance between
the watermarks didn't fundamentally change (modulo block rounding).

> Doesn't then happen that with defrag_mode, in practice kcompactd basically
> always runs until scanners met?

Tracing kcompactd calls to compaction_finished() with defrag_mode:

@[COMPACT_CONTINUE]: 6955
@[COMPACT_COMPLETE]: 19
@[COMPACT_PARTIAL_SKIPPED]: 1
@[COMPACT_SUCCESS]: 17
@wakeuprequests: 3

Of course, similar to kswapd, it might not reach the watermarks and
keep running if there is a continuous stream of allocations consuming
the blocks it's making. Hence the ratio between wakeups & continues.

But when demand stops, it'll balance the high mark and quit.

> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -6724,11 +6724,24 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
> > * meet watermarks.
> > */
> > for_each_managed_zone_pgdat(zone, pgdat, i, highest_zoneidx) {
> > + unsigned long free_pages;
> > +
> > if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
> > mark = promo_wmark_pages(zone);
> > else
> > mark = high_wmark_pages(zone);
> > - if (zone_watermark_ok_safe(zone, order, mark, highest_zoneidx))
>
> I think you just removed the only user of this _safe() function. Is the
> cpu-drift control it does no longer necessary?

Ah good catch. This should actually use zone_page_state_snapshot()
below depending on z->percpu_drift_mark.

This is active when there are enough cpus for the vmstat pcp deltas to
exceed low-min. Afaics this is still necessary, but also still
requires a lot of CPUs to matter (>212 cpus with 64G of memory).

I'll send a follow-up fix.

> > + /*
> > + * In defrag_mode, watermarks must be met in whole
> > + * blocks to avoid polluting allocator fallbacks.
> > + */
> > + if (defrag_mode)
> > + free_pages = zone_page_state(zone, NR_FREE_PAGES_BLOCKS);
> > + else
> > + free_pages = zone_page_state(zone, NR_FREE_PAGES);
> > +
> > + if (__zone_watermark_ok(zone, order, mark, highest_zoneidx,
> > + 0, free_pages))
> > return true;
> > }
> >