Re: [PATCH 5/5] mm: page_alloc: defrag_mode kswapd/kcompactd watermarks
From: Johannes Weiner
Date: Fri Apr 11 2025 - 14:22:07 EST
On Fri, Apr 11, 2025 at 06:51:51PM +0200, Vlastimil Babka wrote:
> On 4/11/25 17:39, Johannes Weiner wrote:
> > On Fri, Apr 11, 2025 at 10:19:58AM +0200, Vlastimil Babka wrote:
> >> On 3/13/25 22:05, Johannes Weiner wrote:
> >> > @@ -2329,6 +2329,22 @@ static enum compact_result __compact_finished(struct compact_control *cc)
> >> > if (!pageblock_aligned(cc->migrate_pfn))
> >> > return COMPACT_CONTINUE;
> >> >
> >> > + /*
> >> > + * When defrag_mode is enabled, make kcompactd target
> >> > + * watermarks in whole pageblocks. Because they can be stolen
> >> > + * without polluting, no further fallback checks are needed.
> >> > + */
> >> > + if (defrag_mode && !cc->direct_compaction) {
> >> > + if (__zone_watermark_ok(cc->zone, cc->order,
> >> > + high_wmark_pages(cc->zone),
> >> > + cc->highest_zoneidx, cc->alloc_flags,
> >> > + zone_page_state(cc->zone,
> >> > + NR_FREE_PAGES_BLOCKS)))
> >> > + return COMPACT_SUCCESS;
> >> > +
> >> > + return COMPACT_CONTINUE;
> >> > + }
> >>
> >> Wonder if this ever succeds in practice. Is high_wmark_pages() even aligned
> >> to pageblock size? If not, and it's X pageblocks and a half, we will rarely
> >> have NR_FREE_PAGES_BLOCKS cover all of that? Also concurrent allocations can
> >> put us below high wmark quickly and then we never satisfy this?
> >
> > The high watermark is not aligned, but why does it have to be? It's a
> > binary condition: met or not met. Compaction continues until it's met.
>
> What I mean is, kswapd will reclaim until the high watermark, which would be
> 32.7 blocks, wake up kcompactd [*] but that can only create up to 32 blocks
> of NR_FREE_PAGES_BLOCKS so it has already lost at that point? (unless
> there's concurrent freeing pushing it above the high wmark)
Ah, but kswapd also uses the (rounded up) NR_FREE_PAGES_BLOCKS check.
Buckle up...
> > Of course, similar to kswapd, it might not reach the watermarks and
> > keep running if there is a continuous stream of allocations consuming
> > the blocks it's making. Hence the ratio between wakeups & continues.
> >
> > But when demand stops, it'll balance the high mark and quit.
>
> Again, since kcompactd can only defragment free space and not create it, it
> may be trying in vain?
>
> [*] now when checking the code between kswapd and kcompactd handover, I
> think I found a another problem?
>
> we have:
> kswapd_try_to_sleep()
> prepare_kswapd_sleep() - needs to succeed for wakeup_kcompactd()
> pgdat_balanced() - needs to be true for prepare_kswapd_sleep() to be true
> - with defrag_mode we want high watermark of NR_FREE_PAGES_BLOCKS, but
> we were only reclaiming until now and didn't wake up kcompactd and
> this actually prevents the wake up?
Correct, so as per above, kswapd also does the NR_FREE_PAGES_BLOCKS
check. At first, at least. So it continues to produce adequate scratch
space and won't leave compaction high and dry on a watermark it cannot
meet. They are indeed coordinated in this aspect.
As far as the *handoff* to kcompactd goes, I've been pulling my hair
over this for a very long time. You're correct about the graph
above. And actually, this is the case before defrag_mode too: if you
wake kswapd with, say, an order-8, it will do pgdat_balanced() checks
against that, seemingly reclaim until the request can succeed, *then*
wake kcompactd and sleep. WTF?
But kswapd has this:
/*
* Fragmentation may mean that the system cannot be rebalanced for
* high-order allocations. If twice the allocation size has been
* reclaimed then recheck watermarks only at order-0 to prevent
* excessive reclaim. Assume that a process requested a high-order
* can direct reclaim/compact.
*/
if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order))
sc->order = 0;
Ignore the comment and just consider the code. What it does for higher
orders (whether defrag_mode is enabled or not), is reclaim a gap for
the order, ensure order-0 is met (but most likely it is), then enter
the sleep path - wake kcompactd and wait for more work.
Effectively, as long as there are pending higher-order requests
looping in the allocator, kswapd does this:
1) reclaim a compaction gap delta
2) wake kcompactd
3) goto 1
This pipelining seems to work *very* well in practice, especially when
there is a large number of concurrent requests.
In the huge allocator original series, I tried to convert kswapd to
use compaction_suitable() to hand over quicker. However, this ran into
scaling issues with higher allocation concurrency: maintaining just a
single, static compact gap when there could be hundreds of allocation
requests waiting for compaction results falls apart fast.
The current code has it right. The comments might be a bit dated and
maybe it could use some fine tuning. But generally, as long as there
are incoming wakeups from the allocator, it makes sense to keep making
more space for compaction as well.
I think Mel was playing 4d chess with this stuff.
[ I kept direct reclaim/compaction out of this defrag_mode series, but
testing suggests the same is likely true for the direct path.
Direct reclaim bails from compaction_ready() if there is a static
compaction gap for that order. But once the gap for a given order is
there, you can get a thundering herd of direct compactors storming
on this gap, most of which will then fail compaction_suitable().
A pipeline of "reclaim gap delta, direct compact, retry" seems to
make more sense there as well. With adequate checks to prevent
excessive reclaim in corner cases of course... ]