[RFC PATCH 0/7] mm: providing ample physical memory contiguity by confining unmovable allocations
From: kaiyang2
Date: Tue Mar 19 2024 - 22:42:39 EST
From: Kaiyang Zhao <kaiyang2@xxxxxxxxxx>
Memory capacity has increased dramatically over the last decades.
Meanwhile, TLB capacity has stagnated, causing a significant virtual
address translation overhead. As a collaboration between Carnegie Mellon
University and Meta, we investigated the issue at Meta’s datacenters and
found that about 20% of CPU cycles are spent doing page walks [1], and
similar results are also reported by Google [2].
To tackle the overhead, we need widespread uses of huge pages. And huge
pages, when they can actually be created, work wonders: they provide up
to 18% higher performance for Meta’s production workloads in our
experiments [1].
However, we observed that huge pages through THP are unreliable because
sufficient physical contiguity may not exist and compaction to recover
from memory fragmentation frequently fails. To ensure workloads get a
reasonable number of huge pages, Meta could not rely on THP and had to
use reserved huge pages. Proposals to add 1GB THP support [5] are even
more dependent on ample availability of physical contiguity.
A major reason for the lack of physical contiguity is the mixing of
unmovable and movable allocations, causing compaction to fail. Quoting
from [3], “in a broad sample of Meta servers, we find that unmovable
allocations make up less than 7% of total memory on average, yet occupy
34% of the 2M blocks in the system. We also found that this effect isn't
correlated with high uptimes, and that servers can get heavily
fragmented within the first hour of running a workload.”
Our proposed solution is to confine the unmovable allocations to a
separate region in physical memory. We experimented with using a CMA
region for the movable allocations, but in this version we use
ZONE_MOVABLE for movable and all other zones for unmovable allocations.
Movable allocations can temporarily reside in the unmovable zones, but
will be proactively moved out by compaction.
To resize ZONE_MOVABLE, we still rely on memory hotplug interfaces. We
export the number of pages scanned on behalf of movable or unmovable
allocations during reclaim to approximate the memory pressure in two
parts of physical memory, and a userspace tool can monitor the metrics
and make resizing decisions. Previously we augmented the PSI interface
to break down memory pressure into movable and unmovable allocation
types, but that approach enlarges the scheduler cacheline footprint.
>From our preliminary observations, just looking at the per-allocation
type scanned counters and with a little tuning, it is sufficient to tell
if there is not enough memory for unmovable allocations and make
resizing decisions.
This patch extends the idea of migratetype isolation at pageblock
granularity posted earlier [3] by Johannes Weiner to an
as-large-as-needed region to better support huge pages of bigger sizes
and hardware TLB coalescing. We’re looking for feedback on the overall
direction, particularly in relation to the recent THP allocator
optimization proposal [4].
The patches are based on 6.4 and are also available on github at
https://github.com/magickaiyang/kernel-contiguous/tree/per_alloc_type_reclaim_counters_oct052023
Kaiyang Zhao (7):
sysfs interface for the boundary of movable zone
Disallows high-order movable allocations in other zones if
ZONE_MOVABLE is populated
compaction accepts a destination zone
vmstat counter for pages migrated across zones
proactively move pages out of unmovable zones in kcompactd
pass gfp mask of the allocation that waked kswapd to track number of
pages scanned on behalf of each alloc type
exports the number of pages scanned on behalf of movable/unmovable
allocations
drivers/base/memory.c | 2 +-
drivers/base/node.c | 32 ++++++
include/linux/compaction.h | 4 +-
include/linux/memory.h | 1 +
include/linux/mmzone.h | 1 +
include/linux/vm_event_item.h | 6 +
mm/compaction.c | 209 ++++++++++++++++++++++++++--------
mm/internal.h | 1 +
mm/page_alloc.c | 10 ++
mm/vmscan.c | 28 ++++-
mm/vmstat.c | 14 ++-
11 files changed, 249 insertions(+), 59 deletions(-)
--
2.40.1