Re: [PATCH 0/6 v2] Calculate pcp->high based on zone sizes and active CPUs

From: David Hildenbrand
Date: Fri May 28 2021 - 05:53:07 EST

On 28.05.21 11:49, Mel Gorman wrote:
On Fri, May 28, 2021 at 11:08:01AM +0200, David Hildenbrand wrote:
On 28.05.21 11:03, David Hildenbrand wrote:
On 28.05.21 10:55, Mel Gorman wrote:
On Thu, May 27, 2021 at 12:36:21PM -0700, Dave Hansen wrote:
Hi Mel,

Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
~512G of persistent memory and 128G of DRAM. The PMEM is in "volatile
use" mode and being managed via the buddy just like the normal RAM.

The PMEM zones are big ones:

present 65011712 = 248 G
high 134595 = 525 M

The PMEM nodes, of course, don't have any CPUs in them.

With your series, the pcp->high value per-cpu is 69584 pages or about
270MB per CPU. Scaled up by the 96 CPU threads, that's ~26GB of
worst-case memory in the pcps per zone, or roughly 10% of the size of
the zone.

When I read about having such big amounts of free memory theoretically
stuck in PCP lists, I guess we really want to start draining the PCP in
alloc_contig_range(), just as we do with memory hotunplug when offlining.

Correction: we already drain the pcp, we just don't temporarily disable it,
so a race as described in offline_pages() could apply:

"Disable pcplists so that page isolation cannot race with freeing
in a way that pages from isolated pageblock are left on pcplists."

Guess we'd then want to move the draining before start_isolate_page_range()
in alloc_contig_range().

Or instead of draining, validate the PFN range in alloc_contig_range
is within the same zone and if so, call zone_pcp_disable() before
start_isolate_page_range and enable after __alloc_contig_migrate_range.

We require the caller to only pass a range within a single zone, so that should be fine.

The only ugly thing about zone_pcp_disable() is mutex_lock(&pcp_batch_high_lock) which would serialize all alloc_contig_range() and even with offline_pages().


David / dhildenb