Re: Reclaim regression after 1c30844d2dfe

From: Ivan Babrou
Date: Wed Feb 12 2020 - 17:45:55 EST


Here's a typical graph: https://imgur.com/a/n03x5yH

* Green (numa0) and blue (numa1) for 4.19
* Yellow (numa0) and orange (numa1) for 5.4

These downward slopes on numa0 on 5.4 are somewhat typical to the
worst case scenario.

If I try to clean up data a bit from a bunch of machines, this is how
numa0 compares to numa1 with 1h average values of free memory above
5GiB:

* https://imgur.com/a/6T4rRzi

I think it's safe to say that numa0 is much much worse, but I cannot
be 100% sure that numa1 is free from adverse effects, they may be just
hiding in the noise caused by rolling reboots.


On Tue, Feb 11, 2020 at 2:16 AM Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
>
> On Fri, Feb 07, 2020 at 02:54:43PM -0800, Ivan Babrou wrote:
> > This change from 5.5 times:
> >
> > * https://github.com/torvalds/linux/commit/1c30844d2dfe
> >
> > > mm: reclaim small amounts of memory when an external fragmentation event occurs
> >
> > Introduced undesired effects in our environment.
> >
> > * NUMA with 2 x CPU
> > * 128GB of RAM
> > * THP disabled
> > * Upgraded from 4.19 to 5.4
> >
> > Before we saw free memory hover at around 1.4GB with no spikes. After
> > the upgrade we saw some machines decide that they need a lot more than
> > that, with frequent spikes above 10GB, often only on a single numa
> > node.
> >
> > We can see kswapd quite active in balance_pgdat (it didn't look like
> > it slept at all):
> >
> > $ ps uax | fgrep kswapd
> > root 1850 23.0 0.0 0 0 ? R Jan30 1902:24 [kswapd0]
> > root 1851 1.8 0.0 0 0 ? S Jan30 152:16 [kswapd1]
> >
> > This in turn massively increased pressure on page cache, which did not
> > go well to services that depend on having a quick response from a
> > local cache backed by solid storage.
> >
> > Here's how it looked like when I zeroed vm.watermark_boost_factor:
> >
> > * https://imgur.com/a/6IZWicU
> >
> > IO subsided from 100% busy in page cache population at 300MB/s on a
> > single SATA drive down to under 100MB/s.
> >
> > This sort of regression doesn't seem like a good thing.
>
> It is not a good thing, so thanks for the report. Obviously I have not
> seen something similar or least not severe enough to show up on my radar.
> I'd seen some increases with reclaim activity affecting benchmarks that
> rely on use-twice data remaining resident but nothing severe enough to
> warrant action.
>
> Can you tell me if it is *always* node 0 that shows crazy activity? I
> ask because some conditions would have to be met for the boost to always
> apply. It's already a per-zone attribute but it is treated indirectly as a
> pgdat property. What I'm thinking is that on node 0, the DMA32 or DMA zone
> gets boosted but vmscan then reclaims from higher zones until the boost is
> removed. That would excessively reclaim memory but be specific to node 0.
>
> I've cc'd Rik as he says he saw something similar even on single node
> systems. The boost applying to lower zones would still affect single
> node systems but NUMA machines always getting impacted by boost would
> show that the boost really needs to be a per-node flag. Sure, we *could*
> apply the reclaim to just the lower zones but that potentially means a
> *lot* of scan activity -- potentially 124G of pages before a lower zone
> page is found on Ivan's machine. That might be the very situation being
> encountered here.
>
> An alternative is that boosting is only ever applied to the highest
> populated zone in a system. The intent of the patch was primarily about
> THP which can use any zone to reduce their allocaation latency. While
> it's possible that there are cases where the latency of other orders
> matter *and* they require lower zones, I think it's unlikely and that
> this would be a safer option overall.
>
> However, overall I think the simpliest is to abort the boosting if
> reclaim is reaching higher priorities without being able to clear
> the boost. The boost is best-effort to reduce allocation latency in
> the future. This approach still has some overhead as there is a reclaim
> pass but kswapd will abort and go to sleep if the normal watermarks
> are met.
>
> This is build tested only. Ideally someone on the cc has a test case
> that can reproduce this specific problem of excessive kswapd activity.
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 572fb17c6273..71dd47172cef 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3462,6 +3462,25 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
> return false;
> }
>
> +static void acct_boosted_reclaim(pg_data_t *pgdat, int classzone_idx,
> + unsigned long *zone_boosts)
> +{
> + struct zone *zone;
> + unsigned long flags;
> + int i;
> +
> + for (i = 0; i <= classzone_idx; i++) {
> + if (!zone_boosts[i])
> + continue;
> +
> + /* Increments are under the zone lock */
> + zone = pgdat->node_zones + i;
> + spin_lock_irqsave(&zone->lock, flags);
> + zone->watermark_boost -= min(zone->watermark_boost, zone_boosts[i]);
> + spin_unlock_irqrestore(&zone->lock, flags);
> + }
> +}
> +
> /* Clear pgdat state for congested, dirty or under writeback. */
> static void clear_pgdat_congested(pg_data_t *pgdat)
> {
> @@ -3654,9 +3673,17 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
> if (!nr_boost_reclaim && balanced)
> goto out;
>
> - /* Limit the priority of boosting to avoid reclaim writeback */
> - if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2)
> - raise_priority = false;
> + /*
> + * Abort boosting if reclaiming at higher priority is not
> + * working to avoid excessive reclaim due to lower zones
> + * being boosted.
> + */
> + if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2) {
> + acct_boosted_reclaim(pgdat, classzone_idx, zone_boosts);
> + boosted = false;
> + nr_boost_reclaim = 0;
> + goto restart;
> + }
>
> /*
> * Do not writeback or swap pages for boosted reclaim. The
> @@ -3738,18 +3765,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
> out:
> /* If reclaim was boosted, account for the reclaim done in this pass */
> if (boosted) {
> - unsigned long flags;
> -
> - for (i = 0; i <= classzone_idx; i++) {
> - if (!zone_boosts[i])
> - continue;
> -
> - /* Increments are under the zone lock */
> - zone = pgdat->node_zones + i;
> - spin_lock_irqsave(&zone->lock, flags);
> - zone->watermark_boost -= min(zone->watermark_boost, zone_boosts[i]);
> - spin_unlock_irqrestore(&zone->lock, flags);
> - }
> + acct_boosted_reclaim(pgdat, classzone_idx, zone_boosts);
>
> /*
> * As there is now likely space, wakeup kcompact to defragment