Re: [PATCH] mm: Proactive compaction

From: Vlastimil Babka
Date: Fri Nov 29 2019 - 08:55:15 EST


On 11/15/19 11:21 PM, Nitin Gupta wrote:
> For some applications we need to allocate almost all memory as
> hugepages. However, on a running system, higher order allocations can
> fail if the memory is fragmented. Linux kernel currently does on-demand
> compaction as we request more hugepages but this style of compaction
> incurs very high latency. Experiments with one-time full memory
> compaction (followed by hugepage allocations) shows that kernel is able
> to restore a highly fragmented memory state to a fairly compacted memory
> state within <1 sec for a 32G system. Such data suggests that a more
> proactive compaction can help us allocate a large fraction of memory as
> hugepages keeping allocation latencies low.
>
> For a more proactive compaction, the approach taken here is to define
> per page-node tunable called âhpage_compaction_effortâ which dictates
> bounds for external fragmentation for HPAGE_PMD_ORDER pages which
> kcompactd should try to maintain.
>
> The tunable is exposed through sysfs:
> /sys/kernel/mm/compaction/node-n/hpage_compaction_effort
>
> The value of this tunable is used to determine low and high thresholds
> for external fragmentation wrt HPAGE_PMD_ORDER order.

Could we instead start with a non-tunable value that would be linked to
to e.g. the number of THP allocations between kcompactd cycles?
Anything we expose will inevitably get set to stone, I'm afraid, so I
would introduce it only as a last resort.

> Note that previous version of this patch [1] was found to introduce too
> many tunables (per-order, extfrag_{low, high}) but this one reduces them
> to just (per-node, hpage_compaction_effort). Also, the new tunable is an
> opaque value instead of asking for specific bounds of âexternal
> fragmentationâ which would have been difficult to estimate. The internal
> interpretation of this opaque value allows for future fine-tuning.
>
> Currently, we use a simple translation from this tunable to [low, high]
> extfrag thresholds (low=100-hpage_compaction_effort, high=low+10%). To
> periodically check per-node extfrag status, we reuse per-node kcompactd
> threads which are woken up every few milliseconds to check the same. If
> any zone on its corresponding node has extfrag above the high threshold
> for the HPAGE_PMD_ORDER order, the thread starts compaction in
> background till all zones are below the low extfrag level for this
> order. By default. By default, the tunable is set to 0 (=> low=100%,
> high=100%).
>
> This patch is largely based on ideas from Michal Hocko posted here:
> https://lore.kernel.org/linux-mm/20161230131412.GI13301@xxxxxxxxxxxxxx/
>
> * Performance data
>
> System: x64_64, 32G RAM, 12-cores.
>
> I made a small driver that allocates as many hugepages as possible and
> measures allocation latency:
>
> The driver first tries to allocate hugepage using GFP_TRANSHUGE_LIGHT
> and if that fails, tries to allocate with `GFP_TRANSHUGE |
> __GFP_RETRY_MAYFAIL`. The drives stops when both methods fail for a
> hugepage allocation.
>
> Before starting the driver, the system was fragmented from a userspace
> program that allocates all memory and then for each 2M aligned section,
> frees 3/4 of base pages using munmap. The workload is mainly anonymous
> userspace pages which are easy to move around. I intentionally avoided
> unmovable pages in this test to see how much latency we incur just by
> hitting the slow path for most allocations.
>
> (all latency values are in microseconds)
>
> - With vanilla kernel 5.4.0-rc5:
>
> percentile latency
> ---------- -------
> 5 7
> 10 7
> 25 8
> 30 8
> 40 8
> 50 8
> 60 9
> 75 215
> 80 222
> 90 323
> 95 429
>
> Total 2M hugepages allocated = 1829 (3.5G worth of hugepages out of 25G
> total free => 14% of free memory could be allocated as hugepages)
>
> - Now with kernel 5.4.0-rc5 + this patch:
> (hpage_compaction_effort = 60)
>
> percentile latency
> ---------- -------
> 5 3
> 10 3
> 25 4
> 30 4
> 40 4
> 50 4
> 60 5
> 75 6
> 80 9
> 90 370
> 95 652
>
> Total 2M hugepages allocated = 11120 (21.7G worth of hugepages out of
> 25G total free => 86% of free memory could be allocated as hugepages)

I wonder about the 14->86% improvement. As you say, this kind of
fragmentation is easy to compact. Why wouldn't GFP_TRANSHUGE |
__GFP_RETRY_MAYFAIL attempts succeed?

Thanks,
Vlastimil

> Above workload produces a memory state which is easy to compact.
> However, if memory is filled with unmovable pages, pro-active compaction
> should essentially back off. To test this aspect, I ran a mix of this
> workload (thanks to Matthew Wilcox for suggesting these):
>
> - dentry_thrash: it opens /tmp/missing.x for x in [1, 1000000] where
> first 10000 files actually exist.
> - pagecache_thrash: it opens a 128G file (on a 32G system) and then
> reads at random offsets.
>
> With this mix of workload, system quickly reaches 90-100% fragmentation
> wrt order-9. Trace of compaction events shows that we keep hitting
> compaction_deferred event, as expected.
>
> After terminating dentry_thrash and dropping denty caches, the system
> could proceed with compaction according to set value of
> hpage_compaction_effort (60).
>
> [1] https://patchwork.kernel.org/patch/11098289/
>
> Signed-off-by: Nitin Gupta <nigupta@xxxxxxxxxx>