Re: [RFC] mm: Proactive compaction

From: Nitin Gupta
Date: Tue Aug 27 2019 - 16:36:58 EST


On Mon, 2019-08-26 at 12:47 +0100, Mel Gorman wrote:
> On Thu, Aug 22, 2019 at 09:57:22PM +0000, Nitin Gupta wrote:
> > > Note that proactive compaction may reduce allocation latency but
> > > it is not
> > > free either. Even though the scanning and migration may happen in
> > > a kernel
> > > thread, tasks can incur faults while waiting for compaction to
> > > complete if the
> > > task accesses data being migrated. This means that costs are
> > > incurred by
> > > applications on a system that may never care about high-order
> > > allocation
> > > latency -- particularly if the allocations typically happen at
> > > application
> > > initialisation time. I recognise that kcompactd makes a bit of
> > > effort to
> > > compact memory out-of-band but it also is typically triggered in
> > > response to
> > > reclaim that was triggered by a high-order allocation request.
> > > i.e. the work
> > > done by the thread is triggered by an allocation request that hit
> > > the slow
> > > paths and not a preemptive measure.
> > >
> >
> > Hitting the slow path for every higher-order allocation is a
> > signification
> > performance/latency issue for applications that requires a large
> > number of
> > these allocations to succeed in bursts. To get some concrete
> > numbers, I
> > made a small driver that allocates as many hugepages as possible
> > and
> > measures allocation latency:
> >
>
> Every higher-order allocation does not necessarily hit the slow path
> nor
> does it incur equal latency.

I did not mean *every* hugepage allocation in a literal sense.
I meant to say: higher order allocation *tend* to hit slow path
with a high probability under reasonably fragmented memory state
and when they do, they incur high latency.


>
> > The driver first tries to allocate hugepage using
> > GFP_TRANSHUGE_LIGHT
> > (referred to as "Light" in the table below) and if that fails,
> > tries to
> > allocate with `GFP_TRANSHUGE | __GFP_RETRY_MAYFAIL` (referred to as
> > "Fallback" in table below). We stop the allocation loop if both
> > methods
> > fail.
> >
> > Table-1: hugepage allocation latencies on vanilla 5.3.0-rc5. All
> > latencies
> > are in microsec.
> >
> > > GFP/Stat | Any | Light | Fallback |
> > > --------: | ---------: | ------: | ---------: |
> > > count | 9908 | 788 | 9120 |
> > > min | 0.0 | 0.0 | 1726.0 |
> > > max | 135387.0 | 142.0 | 135387.0 |
> > > mean | 5494.66 | 1.83 | 5969.26 |
> > > stddev | 21624.04 | 7.58 | 22476.06 |
>
> Given that it is expected that there would be significant tail
> latencies,
> it would be better to analyse this in terms of percentiles. A very
> small
> number of high latency allocations would skew the mean significantly
> which is hinted by the stddev.
>

Here is the same data in terms of percentiles:

- with vanilla kernel 5.3.0-rc5:

percentile latency
ââââââââââ âââââââ
5 1
10 179
0
25 1829
30 1838
40 1854
50 18
71
60 1890
75 1924
80 1945
90 2
206
95 2302


- Now with kernel 5.3.0-rc5 + this patch:

percentile latency
ââââââââââ âââââââ
5 3
10 4
25
4
30 4
40 4
50 4
60
4
75 5
80 5
90 9
95 1
154


> > As you can see, the mean and stddev of allocation is extremely high
> > with
> > the current approach of on-demand compaction.
> >
> > The system was fragmented from a userspace program as I described
> > in this
> > patch description. The workload is mainly anonymous userspace pages
> > which
> > as easy to move around. I intentionally avoided unmovable pages in
> > this
> > test to see how much latency do we incur just by hitting the slow
> > path for
> > a majority of allocations.
> >
>
> Even though, the penalty for proactive compaction is that
> applications
> that may have no interest in higher-order pages may still stall while
> their data is migrated if the data is hot. This is why I think the
> focus
> should be on reducing the latency of compaction -- it benefits
> applications that require higher-order latencies without increasing
> the
> overhead for unrelated applications.
>

Sure, reducing compaction latency would help but there should still
be an option to proactively compact to hide latencies further.


> > > > For a more proactive compaction, the approach taken here is to
> > > > define
> > > > per page-order external fragmentation thresholds and let
> > > > kcompactd
> > > > threads act on these thresholds.
> > > >
> > > > The low and high thresholds are defined per page-order and
> > > > exposed
> > > > through sysfs:
> > > >
> > > > /sys/kernel/mm/compaction/order-
> > > > [1..MAX_ORDER]/extfrag_{low,high}
> > > >
> > >
> > > These will be difficult for an admin to tune that is not
> > > extremely familiar with
> > > how external fragmentation is defined. If an admin asked "how
> > > much will
> > > stalls be reduced by setting this to a different value?", the
> > > answer will always
> > > be "I don't know, maybe some, maybe not".
> > >
> >
> > Yes, this is my main worry. These values can be set to emperically
> > determined values on highly specialized systems like database
> > appliances.
> > However, on a generic system, there is no real reasonable value.
> >
>
> Yep, which means the tunable will be vulnerable to cargo-cult tuning
> recommendations. Or worse, the tuning recommendation will be a flat
> "disable THP".
>

I thought more on this and yes, exposing a system wide per-order
extfrag threshold may not be the best approach. Instead, expose a
specific interface to compact a zone to a specified level and leave the
policy on when to trigger (based on extfrag levels, system load etc.)
upto the user (kernel driver or userspace daemon).


> > Still, at the very least, I would like an interface that allows
> > compacting
> > system to a reasonable state. Something like:
> >
> > compact_extfrag(node, zone, order, high, low)
> >
> > which start compaction if extfrag > high, and goes on till extfrag
> > < low.
> >
> > It's possible that there are too many unmovable pages mixed around
> > for
> > compaction to succeed, still it's a reasonable interface to expose
> > rather
> > than forced on-demand style of compaction (please see data below).
> >
> > How (and if) to expose it to userspace (sysfs etc.) can be a
> > separate
> > discussion.
> >
>
> That would be functionally similar to vm.compact_memory although it
> would either need an extension or a separate tunable. With sysfs,
> there
> could be a per-node file that takes with a watermark and order tuple
> to
> trigger the interface.
>

Something like:
/sys/kernel/mm/node-n/compact
or, /sys/kernel/mm/compact-n
where n in [0, NUM_NODES],

which takes tuple watermark and order, should do?

I'm also okay not adding any of these sysfs interface for now.

> > > > Per-node kcompactd thread is woken up every few seconds to
> > > > check if
> > > > any zone on its node has extfrag above the extfrag_high
> > > > threshold for
> > > > any order, in which case the thread starts compaction in the
> > > > backgrond
> > > > till all zones are below extfrag_low level for all orders. By
> > > > default
> > > > both these thresolds are set to 100 for all orders which
> > > > essentially
> > > > disables kcompactd.
> > > >
> > > > To avoid wasting CPU cycles when compaction cannot help, such
> > > > as when
> > > > memory is full, we check both, extfrag > extfrag_high and
> > > > compaction_suitable(zone). This allows kcomapctd thread to
> > > > stays
> > > > inactive even if extfrag thresholds are not met.
> > > >
> > >
> > > There is still a risk that if a system is completely fragmented
> > > that it may
> > > consume CPU on pointless compaction cycles. This is why
> > > compaction from
> > > kernel thread context makes no special effort and bails
> > > relatively quickly and
> > > assumes that if an application really needs high-order pages that
> > > it'll incur
> > > the cost at allocation time.
> > >
> >
> > As data in Table-1 shows, on-demand compaction can add high latency
> > to
> > every single allocation. I think it would be a significant
> > improvement (see
> > Table-2) to at least expose an interface to allow proactive
> > compaction
> > (like compaction_extfrag), which a driver can itself run in
> > background. This
> > way, we need not add any tunables to the kernel itself and leave
> > compaction
> > decision to specialized kernel/userspace monitors.
> >
>
> I do not have any major objection -- again, it's not that dissimilar
> to
> compact_memory (although that was intended as a debugging interface).
>

Yes, the only difference is I want to stop compaction compaction till
we hit the given extfrag level.


> > > > This patch is largely based on ideas from Michal Hocko posted
> > > > here:
> > > > https://lore.kernel.org/linux-
> > > mm/20161230131412.GI13301@xxxxxxxxxxxxxx
> > > > /
> > > >
> > > > Testing done (on x86):
> > > > - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} =
> > > > {25, 30}
> > > > respectively.
> > > > - Use a test program to fragment memory: the program allocates
> > > > all
> > > > memory and then for each 2M aligned section, frees 3/4 of base
> > > > pages
> > > > using munmap.
> > > > - kcompactd0 detects fragmentation for order-9 > extfrag_high
> > > > and
> > > > starts compaction till extfrag < extfrag_low for order-9.
> > > >
> > >
> > > This is a somewhat optimisitic allocation scenario. The
> > > interesting ones are
> > > when a system is fragmenteed in a manner that is not trivial to
> > > resolve -- e.g.
> > > after a prolonged period of time with unmovable/reclaimable
> > > allocations
> > > stealing pageblocks. It's also fairly difficult to analyse if
> > > this is helping
> > > because you cannot measure after the fact how much time was saved
> > > in
> > > allocation time due to the work done by kcompactd. It is also
> > > hard to
> > > determine if the sum of the stalls incurred by proactive
> > > compaction is lower
> > > than the time saved at allocation time.
> > >
> > > I fear that the user-visible effect will be times when there are
> > > very short but
> > > numerous stalls due to proactive compaction running in the
> > > background that
> > > will be hard to detect while the benefits may be invisible.
> > >
> >
> > Pro-active compaction can be done in a non-time-critical context,
> > so to
> > estimate its benefits we can just compare data from Table-1 the
> > same run,
> > under a similar fragmentation state, but with this patch applied:
> >
>
> How do you define what a non-time-critical context is? Once
> compaction
> starts, an applications data becomes temporarily unavailable during
> migration.


By time-critical-context I roughly mean contexts where hugepage
allocations are triggered in response to a user action and any delay
here would be directly noticable by the user. Compare this scenario
with a backround thread doing compaction: this activity can appear
as random freezes for running applications. Whether this
effect on unrelated applications is acceptable or not can be left
to user of this new compaction interface.

>
> > Table-2: hugepage allocation latencies with this patch applied on
> > 5.3.0-rc5.
> >
> > > GFP_Stat | Any | Light | Fallback |
> > > --------:| ----------:| ---------:| ----------:|
> > > count | 12197.0 | 11167.0 | 1030.0 |
> > > min | 2.0 | 2.0 | 5.0 |
> > > max | 361727.0 | 26.0 | 361727.0 |
> > > mean | 366.05 | 4.48 | 4286.13 |
> > > stddev | 4575.53 | 1.41 | 15209.63 |
> >
> > We can see that mean latency dropped to 366us compared with 5494us
> > before.
> >
> > This is an optimistic scenario where there was a little mix of
> > unmovable
> > pages but still the data shows that in case compaction can succeed,
> > pro-active compaction can give signification reduction higher-order
> > allocation latencies.
> >
>
> Which still does not address the point that reducing compaction
> overhead
> is generally beneficial without incurring additional overhead to
> unrelated applications.
>

Yes, reducing compaction latency is always beneficial especially if
it can be done in a way not to touch (hot) pages from unrelated
applications.
Even with good improvements in this area, proactive compaction would
still be good to have.


> I'm not against the use of an interface because it requires an
> application
> to make a deliberate choice and understand the downsides which can be
> documented. An automatic proactive compaction may impact users that
> have
> no idea the feature even exists.
>

I'm now dropping the idea of exposing per-order extfrag thresholds and
would now focus on an interface to compact memory to reach a given
extfrag level instead.

Thanks,
Nitin