Re: [PATCH 2/6] mm/page_alloc: Disassociate the pcp->high from pcp->batch
From: Vlastimil Babka
Date: Wed May 26 2021 - 14:14:17 EST
On 5/25/21 10:01 AM, Mel Gorman wrote:
> The pcp high watermark is based on the batch size but there is no
> relationship between them other than it is convenient to use early in
> boot.
>
> This patch takes the first step and bases pcp->high on the zone low
> watermark split across the number of CPUs local to a zone while the batch
> size remains the same to avoid increasing allocation latencies. The intent
> behind the default pcp->high is "set the number of PCP pages such that
> if they are all full that background reclaim is not started prematurely".
>
> Note that in this patch the pcp->high values are adjusted after memory
> hotplug events, min_free_kbytes adjustments and watermark scale factor
> adjustments but not CPU hotplug events which is handled later in the
> series.
>
> On a test KVM instance;
>
> Before grep -E "high:|batch" /proc/zoneinfo | tail -2
> high: 378
> batch: 63
>
> After grep -E "high:|batch" /proc/zoneinfo | tail -2
> high: 649
> batch: 63
>
> Signed-off-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
...
> @@ -6637,6 +6628,34 @@ static int zone_batchsize(struct zone *zone)
> #endif
> }
>
> +static int zone_highsize(struct zone *zone, int batch)
> +{
> +#ifdef CONFIG_MMU
> + int high;
> + int nr_local_cpus;
> +
> + /*
> + * The high value of the pcp is based on the zone low watermark
> + * so that if they are full then background reclaim will not be
> + * started prematurely. The value is split across all online CPUs
> + * local to the zone. Note that early in boot that CPUs may not be
> + * online yet.
> + */
> + nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone))));
> + high = low_wmark_pages(zone) / nr_local_cpus;
> +
> + /*
> + * Ensure high is at least batch*4. The multiple is based on the
> + * historical relationship between high and batch.
> + */
> + high = max(high, batch << 2);
> +
> + return high;
> +#else
> + return 0;
> +#endif
> +}
> +
> /*
> * pcp->high and pcp->batch values are related and generally batch is lower
> * than high. They are also related to pcp->count such that count is lower
> @@ -6698,11 +6717,10 @@ static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long h
> */
> static void zone_set_pageset_high_and_batch(struct zone *zone)
> {
> - unsigned long new_high, new_batch;
> + int new_high, new_batch;
>
> - new_batch = zone_batchsize(zone);
> - new_high = 6 * new_batch;
> - new_batch = max(1UL, 1 * new_batch);
> + new_batch = max(1, zone_batchsize(zone));
> + new_high = zone_highsize(zone, new_batch);
>
> if (zone->pageset_high == new_high &&
> zone->pageset_batch == new_batch)
> @@ -8170,6 +8188,12 @@ static void __setup_per_zone_wmarks(void)
> zone->_watermark[WMARK_LOW] = min_wmark_pages(zone) + tmp;
> zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
>
> + /*
> + * The watermark size have changed so update the pcpu batch
> + * and high limits or the limits may be inappropriate.
> + */
> + zone_set_pageset_high_and_batch(zone);
Hm so this puts the call in the path of various watermark related sysctl
handlers, but it's not protected by pcp_batch_high_lock. The zone lock won't
help against zone_pcp_update() from a hotplug handler. On the other hand, since
hotplug handlers also call __setup_per_zone_wmarks(), the zone_pcp_update()
calls there are now redundant and could be removed, no?
But later there will be a new sysctl in patch 6/6 using pcp_batch_high_lock,
thus that one will not be protected against the watermark related sysctl
handlers that reach here.
To solve all this, seems like the static lock in setup_per_zone_wmarks() could
become a top-level visible lock and pcp high/batch updates could switch to that
one instead of own pcp_batch_high_lock. And zone_pcp_update() calls from hotplug
handlers could be removed.
> +
> spin_unlock_irqrestore(&zone->lock, flags);
> }
>
>