Re: [RFC] mm: Proactive compaction

From: Nitin Gupta
Date: Thu Aug 22 2019 - 17:57:32 EST


> -----Original Message-----
> From: owner-linux-mm@xxxxxxxxx <owner-linux-mm@xxxxxxxxx> On Behalf
> Of Mel Gorman
> Sent: Thursday, August 22, 2019 1:52 AM
> To: Nitin Gupta <nigupta@xxxxxxxxxx>
> Cc: akpm@xxxxxxxxxxxxxxxxxxxx; vbabka@xxxxxxx; mhocko@xxxxxxxx;
> dan.j.williams@xxxxxxxxx; Yu Zhao <yuzhao@xxxxxxxxxx>; Matthew Wilcox
> <willy@xxxxxxxxxxxxx>; Qian Cai <cai@xxxxxx>; Andrey Ryabinin
> <aryabinin@xxxxxxxxxxxxx>; Roman Gushchin <guro@xxxxxx>; Greg Kroah-
> Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>; Kees Cook
> <keescook@xxxxxxxxxxxx>; Jann Horn <jannh@xxxxxxxxxx>; Johannes
> Weiner <hannes@xxxxxxxxxxx>; Arun KS <arunks@xxxxxxxxxxxxxx>; Janne
> Huttunen <janne.huttunen@xxxxxxxxx>; Konstantin Khlebnikov
> <khlebnikov@xxxxxxxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; linux-
> mm@xxxxxxxxx
> Subject: Re: [RFC] mm: Proactive compaction
>
> On Fri, Aug 16, 2019 at 02:43:30PM -0700, Nitin Gupta wrote:
> > For some applications we need to allocate almost all memory as
> > hugepages. However, on a running system, higher order allocations can
> > fail if the memory is fragmented. Linux kernel currently does
> > on-demand compaction as we request more hugepages but this style of
> > compaction incurs very high latency. Experiments with one-time full
> > memory compaction (followed by hugepage allocations) shows that kernel
> > is able to restore a highly fragmented memory state to a fairly
> > compacted memory state within <1 sec for a 32G system. Such data
> > suggests that a more proactive compaction can help us allocate a large
> > fraction of memory as hugepages keeping allocation latencies low.
> >
>
> Note that proactive compaction may reduce allocation latency but it is not
> free either. Even though the scanning and migration may happen in a kernel
> thread, tasks can incur faults while waiting for compaction to complete if the
> task accesses data being migrated. This means that costs are incurred by
> applications on a system that may never care about high-order allocation
> latency -- particularly if the allocations typically happen at application
> initialisation time. I recognise that kcompactd makes a bit of effort to
> compact memory out-of-band but it also is typically triggered in response to
> reclaim that was triggered by a high-order allocation request. i.e. the work
> done by the thread is triggered by an allocation request that hit the slow
> paths and not a preemptive measure.
>

Hitting the slow path for every higher-order allocation is a signification
performance/latency issue for applications that requires a large number of
these allocations to succeed in bursts. To get some concrete numbers, I
made a small driver that allocates as many hugepages as possible and
measures allocation latency:

The driver first tries to allocate hugepage using GFP_TRANSHUGE_LIGHT
(referred to as "Light" in the table below) and if that fails, tries to
allocate with `GFP_TRANSHUGE | __GFP_RETRY_MAYFAIL` (referred to as
"Fallback" in table below). We stop the allocation loop if both methods
fail.

Table-1: hugepage allocation latencies on vanilla 5.3.0-rc5. All latencies
are in microsec.

| GFP/Stat | Any | Light | Fallback |
|--------: | ---------: | ------: | ---------: |
| count | 9908 | 788 | 9120 |
| min | 0.0 | 0.0 | 1726.0 |
| max | 135387.0 | 142.0 | 135387.0 |
| mean | 5494.66 | 1.83 | 5969.26 |
| stddev | 21624.04 | 7.58 | 22476.06 |

As you can see, the mean and stddev of allocation is extremely high with
the current approach of on-demand compaction.

The system was fragmented from a userspace program as I described in this
patch description. The workload is mainly anonymous userspace pages which
as easy to move around. I intentionally avoided unmovable pages in this
test to see how much latency do we incur just by hitting the slow path for
a majority of allocations.


> > For a more proactive compaction, the approach taken here is to define
> > per page-order external fragmentation thresholds and let kcompactd
> > threads act on these thresholds.
> >
> > The low and high thresholds are defined per page-order and exposed
> > through sysfs:
> >
> > /sys/kernel/mm/compaction/order-[1..MAX_ORDER]/extfrag_{low,high}
> >
>
> These will be difficult for an admin to tune that is not extremely familiar with
> how external fragmentation is defined. If an admin asked "how much will
> stalls be reduced by setting this to a different value?", the answer will always
> be "I don't know, maybe some, maybe not".
>

Yes, this is my main worry. These values can be set to emperically
determined values on highly specialized systems like database appliances.
However, on a generic system, there is no real reasonable value.


Still, at the very least, I would like an interface that allows compacting
system to a reasonable state. Something like:

compact_extfrag(node, zone, order, high, low)

which start compaction if extfrag > high, and goes on till extfrag < low.

It's possible that there are too many unmovable pages mixed around for
compaction to succeed, still it's a reasonable interface to expose rather
than forced on-demand style of compaction (please see data below).

How (and if) to expose it to userspace (sysfs etc.) can be a separate
discussion.


> > Per-node kcompactd thread is woken up every few seconds to check if
> > any zone on its node has extfrag above the extfrag_high threshold for
> > any order, in which case the thread starts compaction in the backgrond
> > till all zones are below extfrag_low level for all orders. By default
> > both these thresolds are set to 100 for all orders which essentially
> > disables kcompactd.
> >
> > To avoid wasting CPU cycles when compaction cannot help, such as when
> > memory is full, we check both, extfrag > extfrag_high and
> > compaction_suitable(zone). This allows kcomapctd thread to stays
> > inactive even if extfrag thresholds are not met.
> >
>
> There is still a risk that if a system is completely fragmented that it may
> consume CPU on pointless compaction cycles. This is why compaction from
> kernel thread context makes no special effort and bails relatively quickly and
> assumes that if an application really needs high-order pages that it'll incur
> the cost at allocation time.
>

As data in Table-1 shows, on-demand compaction can add high latency to
every single allocation. I think it would be a significant improvement (see
Table-2) to at least expose an interface to allow proactive compaction
(like compaction_extfrag), which a driver can itself run in background. This
way, we need not add any tunables to the kernel itself and leave compaction
decision to specialized kernel/userspace monitors.


> > This patch is largely based on ideas from Michal Hocko posted here:
> > https://lore.kernel.org/linux-
> mm/20161230131412.GI13301@xxxxxxxxxxxxxx
> > /
> >
> > Testing done (on x86):
> > - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} = {25, 30}
> > respectively.
> > - Use a test program to fragment memory: the program allocates all
> > memory and then for each 2M aligned section, frees 3/4 of base pages
> > using munmap.
> > - kcompactd0 detects fragmentation for order-9 > extfrag_high and
> > starts compaction till extfrag < extfrag_low for order-9.
> >
>
> This is a somewhat optimisitic allocation scenario. The interesting ones are
> when a system is fragmenteed in a manner that is not trivial to resolve -- e.g.
> after a prolonged period of time with unmovable/reclaimable allocations
> stealing pageblocks. It's also fairly difficult to analyse if this is helping
> because you cannot measure after the fact how much time was saved in
> allocation time due to the work done by kcompactd. It is also hard to
> determine if the sum of the stalls incurred by proactive compaction is lower
> than the time saved at allocation time.
>
> I fear that the user-visible effect will be times when there are very short but
> numerous stalls due to proactive compaction running in the background that
> will be hard to detect while the benefits may be invisible.
>

Pro-active compaction can be done in a non-time-critical context, so to
estimate its benefits we can just compare data from Table-1 the same run,
under a similar fragmentation state, but with this patch applied:


Table-2: hugepage allocation latencies with this patch applied on
5.3.0-rc5.

| GFP_Stat | Any | Light | Fallback |
| --------:| ----------:| ---------:| ----------:|
| count | 12197.0 | 11167.0 | 1030.0 |
| min | 2.0 | 2.0 | 5.0 |
| max | 361727.0 | 26.0 | 361727.0 |
| mean | 366.05 | 4.48 | 4286.13 |
| stddev | 4575.53 | 1.41 | 15209.63 |


We can see that mean latency dropped to 366us compared with 5494us before.

This is an optimistic scenario where there was a little mix of unmovable
pages but still the data shows that in case compaction can succeed,
pro-active compaction can give signification reduction higher-order
allocation latencies.


> > The patch has plenty of rough edges but posting it early to see if I'm
> > going in the right direction and to get some early feedback.
> >
>
> As unappealing as it sounds, I think it is better to try improve the allocation
> latency itself instead of trying to hide the cost in a kernel thread. It's far
> harder to implement as compaction is not easy but it would be more
> obvious what the savings are by looking at a histogram of allocation latencies
> -- there are other metrics that could be considered but that's the obvious
> one.
>

Improving allocation latencies in itself would be a separate effort. In
case memory is full or fragmented we have to deal with reclaim or
compaction to make allocation (esp. higher-order) succeed. In particular,
forcing compaction to be done only on-demand is in my opinion not the right
approach. As I detailed above, at the very minimum, we need an interface
like `compact_extfrag` which can leave the decision on specific
kernel/userspace drivers on how pro-active you want compaction to be.