Re: [PATCH] mm: let kswapd work again for node that used to be hopeless but may not now

From: Byungchul Park
Date: Thu May 23 2024 - 20:47:59 EST


On Thu, May 23, 2024 at 01:53:37PM +0100, Karim Manaouil wrote:
> On Thu, May 23, 2024 at 02:14:06PM +0900, Byungchul Park wrote:
> > I suffered from kswapd stopped in the following scenario:
> >
> > CONFIG_NUMA_BALANCING enabled
> > sysctl_numa_balancing_mode set to NUMA_BALANCING_MEMORY_TIERING
> > numa node0 (500GB local DRAM, 128 CPUs)
> > numa node1 (100GB CXL memory, no CPUs)
> > swap off
> >
> > 1) Run any workload using a lot of anon pages e.g. mmap(200GB).
> > 2) Keep adding another workload using a lot of anon pages.
> > 3) The DRAM becomes filled with only anon pages through promotion.
> > 4) Demotion barely works due to severe memory pressure.
> > 5) kswapd for node0 stops because of the unreclaimable anon pages.
>
> It's not very clear to me, but if I understand correctly, if you have

I don't have free memory on CXL.

> free memory on CXL, kswapd0 should not stop as long as demotion is

kswapd0 stops because demotion barely works.

> successfully migrating the pages from DRAM to CXL and returns that as
> nr_reclaimed in shrink_folio_list()?
>
> If that's the case, kswapd0 is making progress and shouldn't give up.

It's not the case.

> If CXL memory is also filled and migration fails, then it doesn't make
> sense to me to wake up kswapd0 as it obvisoly won't help with anything,

It's true *only* when it won't help with anything.

However, kswapd should work again once the system got back to normal
e.g. by terminating the anon hoggers. I addressed this issue.

> because, you guessed it, you have no memory in the first place!!
>
> > 6) Manually kill the memory hoggers.

This is the point.

Byungchul

> > 7) kswapd is still stopped even though the system got back to normal.
> >
> > From now on, the system should run without reclaim service in background
> > served by kswapd until direct reclaim will do for that. Even worse,
> > tiering mechanism is no longer able to work because kswapd has stopped
> > that the mechanism relies on.
> >
> > However, after 6), the DRAM will be filled with pages that might or
> > might not be reclaimable, that depends on how those are going to be used.
> > Since those are potentially reclaimable anyway, it's worth hopefully
> > trying reclaim by allowing kswapd to work again if needed.
> >
> > Signed-off-by: Byungchul Park <byungchul@xxxxxx>
> > ---
> > include/linux/mmzone.h | 4 ++++
> > mm/page_alloc.c | 12 ++++++++++++
> > mm/vmscan.c | 21 ++++++++++++++++-----
> > 3 files changed, 32 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index c11b7cde81ef..7c0ba90ea7b4 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -1331,6 +1331,10 @@ typedef struct pglist_data {
> > enum zone_type kswapd_highest_zoneidx;
> >
> > int kswapd_failures; /* Number of 'reclaimed == 0' runs */
> > + int nr_may_reclaimable; /* Number of pages that have been
> > + allocated since considered the
> > + node is hopeless due to too many
> > + kswapd_failures. */
> >
> > #ifdef CONFIG_COMPACTION
> > int kcompactd_max_order;
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 14d39f34d336..1dd2daede014 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1538,8 +1538,20 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
> > static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
> > unsigned int alloc_flags)
> > {
> > + pg_data_t *pgdat = page_pgdat(page);
> > +
> > post_alloc_hook(page, order, gfp_flags);
> >
> > + /*
> > + * New pages might or might not be reclaimable depending on how
> > + * these pages are going to be used. However, since these are
> > + * potentially reclaimable, it's worth hopefully trying reclaim
> > + * by allowing kswapd to work again even if there have been too
> > + * many ->kswapd_failures, if ->nr_may_reclaimable is big enough.
> > + */
> > + if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
> > + pgdat->nr_may_reclaimable += 1 << order;
> > +
> > if (order && (gfp_flags & __GFP_COMP))
> > prep_compound_page(page, order);
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 3ef654addd44..5b39090c4ef1 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -4943,6 +4943,7 @@ static void lru_gen_shrink_node(struct pglist_data *pgdat, struct scan_control *
> > done:
> > /* kswapd should never fail */
> > pgdat->kswapd_failures = 0;
> > + pgdat->nr_may_reclaimable = 0;
> > }
> >
> > /******************************************************************************
> > @@ -5991,9 +5992,10 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> > * sleep. On reclaim progress, reset the failure counter. A
> > * successful direct reclaim run will revive a dormant kswapd.
> > */
> > - if (reclaimable)
> > + if (reclaimable) {
> > pgdat->kswapd_failures = 0;
> > - else if (sc->cache_trim_mode)
> > + pgdat->nr_may_reclaimable = 0;
> > + } else if (sc->cache_trim_mode)
> > sc->cache_trim_mode_failed = 1;
> > }
> >
> > @@ -6636,6 +6638,11 @@ static void clear_pgdat_congested(pg_data_t *pgdat)
> > clear_bit(PGDAT_WRITEBACK, &pgdat->flags);
> > }
> >
> > +static bool may_recaimable(pg_data_t *pgdat, int order)
> > +{
> > + return pgdat->nr_may_reclaimable >= 1 << order;
> > +}
> > +
> > /*
> > * Prepare kswapd for sleeping. This verifies that there are no processes
> > * waiting in throttle_direct_reclaim() and that watermarks have been met.
> > @@ -6662,7 +6669,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order,
> > wake_up_all(&pgdat->pfmemalloc_wait);
> >
> > /* Hopeless node, leave it to direct reclaim */
> > - if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
> > + if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES &&
> > + !may_recaimable(pgdat, order))
> > return true;
> >
> > if (pgdat_balanced(pgdat, order, highest_zoneidx)) {
> > @@ -6940,8 +6948,10 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
> > goto restart;
> > }
> >
> > - if (!sc.nr_reclaimed)
> > + if (!sc.nr_reclaimed) {
> > pgdat->kswapd_failures++;
> > + pgdat->nr_may_reclaimable = 0;
> > + }
> >
> > out:
> > clear_reclaim_active(pgdat, highest_zoneidx);
> > @@ -7204,7 +7214,8 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order,
> > return;
> >
> > /* Hopeless node, leave it to direct reclaim if possible */
> > - if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ||
> > + if ((pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES &&
> > + !may_recaimable(pgdat, order)) ||
> > (pgdat_balanced(pgdat, order, highest_zoneidx) &&
> > !pgdat_watermark_boosted(pgdat, highest_zoneidx))) {
> > /*
> > --
> > 2.17.1
> >
> >