RE: [PATCH] mm: Add callback for defining compaction completion
From: Nitin Gupta
Date: Tue Sep 10 2019 - 18:28:05 EST
> -----Original Message-----
> From: owner-linux-mm@xxxxxxxxx <owner-linux-mm@xxxxxxxxx> On Behalf
> Of Michal Hocko
> Sent: Tuesday, September 10, 2019 1:19 PM
> To: Nitin Gupta <nigupta@xxxxxxxxxx>
> Cc: akpm@xxxxxxxxxxxxxxxxxxxx; vbabka@xxxxxxx;
> mgorman@xxxxxxxxxxxxxxxxxxx; dan.j.williams@xxxxxxxxx;
> khalid.aziz@xxxxxxxxxx; Matthew Wilcox <willy@xxxxxxxxxxxxx>; Yu Zhao
> <yuzhao@xxxxxxxxxx>; Qian Cai <cai@xxxxxx>; Andrey Ryabinin
> <aryabinin@xxxxxxxxxxxxx>; Allison Randal <allison@xxxxxxxxxxx>; Mike
> Rapoport <rppt@xxxxxxxxxxxxxxxxxx>; Thomas Gleixner
> <tglx@xxxxxxxxxxxxx>; Arun KS <arunks@xxxxxxxxxxxxxx>; Wei Yang
> <richard.weiyang@xxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; linux-
> mm@xxxxxxxxx
> Subject: Re: [PATCH] mm: Add callback for defining compaction completion
>
> On Tue 10-09-19 13:07:32, Nitin Gupta wrote:
> > For some applications we need to allocate almost all memory as
> hugepages.
> > However, on a running system, higher order allocations can fail if the
> > memory is fragmented. Linux kernel currently does on-demand
> compaction
> > as we request more hugepages but this style of compaction incurs very
> > high latency. Experiments with one-time full memory compaction
> > (followed by hugepage allocations) shows that kernel is able to
> > restore a highly fragmented memory state to a fairly compacted memory
> > state within <1 sec for a 32G system. Such data suggests that a more
> > proactive compaction can help us allocate a large fraction of memory
> > as hugepages keeping allocation latencies low.
> >
> > In general, compaction can introduce unexpected latencies for
> > applications that don't even have strong requirements for contiguous
> > allocations. It is also hard to efficiently determine if the current
> > system state can be easily compacted due to mixing of unmovable
> > memory. Due to these reasons, automatic background compaction by the
> > kernel itself is hard to get right in a way which does not hurt unsuspecting
> applications or waste CPU cycles.
>
> We do trigger background compaction on a high order pressure from the
> page allocator by waking up kcompactd. Why is that not sufficient?
>
Whenever kcompactd is woken up, it does just enough work to create
one free page of the given order (compaction_control.order) or higher.
Such a design causes very high latency for workloads where we want
to allocate lots of hugepages in short period of time. With pro-active
compaction we can hide much of this latency. For some more background
discussion and data, please see this thread:
https://patchwork.kernel.org/patch/11098289/
> > Even with these caveats, pro-active compaction can still be very
> > useful in certain scenarios to reduce hugepage allocation latencies.
> > This callback interface allows drivers to drive compaction based on
> > their own policies like the current level of external fragmentation
> > for a particular order, system load etc.
>
> So we do not trust the core MM to make a reasonable decision while we give
> a free ticket to modules. How does this make any sense at all? How is a
> random module going to make a more informed decision when it has less
> visibility on the overal MM situation.
>
Embedding any specific policy (like: keep external fragmentation for order-9
between 30-40%) within MM core looks like a bad idea. As a driver, we
can easily measure parameters like system load, current fragmentation level
for any order in any zone etc. to make an informed decision.
See the thread I refereed above for more background discussion.
> If you need to control compaction from the userspace you have an interface
> for that. It is also completely unexplained why you need a completion
> callback.
>
/proc/sys/vm/compact_memory does whole system compaction which is
often too much as a pro-active compaction strategy. To get more control
over how to compaction work to do, I have added a compaction callback
which controls how much work is done in one compaction cycle.
For example, as a test for this patch, I have a small test driver which defines
[low, high] external fragmentation thresholds for the HPAGE_ORDER. Whenever
extfrag is within this range, I run compact_zone_order with a callback which
returns COMPACT_CONTINUE till extfrag > low threshold and returns
COMPACT_PARTIAL_SKIPPED when extfrag <= low.
Here's the code for this sample driver:
https://gitlab.com/nigupta/memstress/snippets/1893847
Maybe this code can be added to Documentation/...
Thanks,
Nitin
>
> > Signed-off-by: Nitin Gupta <nigupta@xxxxxxxxxx>
> > ---
> > include/linux/compaction.h | 10 ++++++++++
> > mm/compaction.c | 20 ++++++++++++++------
> > mm/internal.h | 2 ++
> > 3 files changed, 26 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> > index 9569e7c786d3..1ea828450fa2 100644
> > --- a/include/linux/compaction.h
> > +++ b/include/linux/compaction.h
> > @@ -58,6 +58,16 @@ enum compact_result {
> > COMPACT_SUCCESS,
> > };
> >
> > +/* Callback function to determine if compaction is finished. */
> > +typedef enum compact_result (*compact_finished_cb)(
> > + struct zone *zone, int order);
> > +
> > +enum compact_result compact_zone_order(struct zone *zone, int order,
> > + gfp_t gfp_mask, enum compact_priority prio,
> > + unsigned int alloc_flags, int classzone_idx,
> > + struct page **capture,
> > + compact_finished_cb compact_finished_cb);
> > +
> > struct alloc_context; /* in mm/internal.h */
> >
> > /*
> > diff --git a/mm/compaction.c b/mm/compaction.c index
> > 952dc2fb24e5..73e2e9246bc4 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -1872,6 +1872,9 @@ static enum compact_result
> __compact_finished(struct compact_control *cc)
> > return COMPACT_PARTIAL_SKIPPED;
> > }
> >
> > + if (cc->compact_finished_cb)
> > + return cc->compact_finished_cb(cc->zone, cc->order);
> > +
> > if (is_via_compact_memory(cc->order))
> > return COMPACT_CONTINUE;
> >
> > @@ -2274,10 +2277,11 @@ compact_zone(struct compact_control *cc,
> struct capture_control *capc)
> > return ret;
> > }
> >
> > -static enum compact_result compact_zone_order(struct zone *zone, int
> > order,
> > +enum compact_result compact_zone_order(struct zone *zone, int order,
> > gfp_t gfp_mask, enum compact_priority prio,
> > unsigned int alloc_flags, int classzone_idx,
> > - struct page **capture)
> > + struct page **capture,
> > + compact_finished_cb compact_finished_cb)
> > {
> > enum compact_result ret;
> > struct compact_control cc = {
> > @@ -2293,10 +2297,11 @@ static enum compact_result
> compact_zone_order(struct zone *zone, int order,
> > MIGRATE_ASYNC :
> MIGRATE_SYNC_LIGHT,
> > .alloc_flags = alloc_flags,
> > .classzone_idx = classzone_idx,
> > - .direct_compaction = true,
> > + .direct_compaction = !compact_finished_cb,
> > .whole_zone = (prio == MIN_COMPACT_PRIORITY),
> > .ignore_skip_hint = (prio == MIN_COMPACT_PRIORITY),
> > - .ignore_block_suitable = (prio == MIN_COMPACT_PRIORITY)
> > + .ignore_block_suitable = (prio ==
> MIN_COMPACT_PRIORITY),
> > + .compact_finished_cb = compact_finished_cb
> > };
> > struct capture_control capc = {
> > .cc = &cc,
> > @@ -2313,11 +2318,13 @@ static enum compact_result
> compact_zone_order(struct zone *zone, int order,
> > VM_BUG_ON(!list_empty(&cc.freepages));
> > VM_BUG_ON(!list_empty(&cc.migratepages));
> >
> > - *capture = capc.page;
> > + if (capture)
> > + *capture = capc.page;
> > current->capture_control = NULL;
> >
> > return ret;
> > }
> > +EXPORT_SYMBOL(compact_zone_order);
> >
> > int sysctl_extfrag_threshold = 500;
> >
> > @@ -2361,7 +2368,8 @@ enum compact_result
> try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
> > }
> >
> > status = compact_zone_order(zone, order, gfp_mask, prio,
> > - alloc_flags, ac_classzone_idx(ac), capture);
> > + alloc_flags, ac_classzone_idx(ac), capture,
> > + NULL);
> > rc = max(status, rc);
> >
> > /* The allocation should succeed, stop compacting */ diff --git
> > a/mm/internal.h b/mm/internal.h index e32390802fd3..f873f7c2b9dc
> > 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -11,6 +11,7 @@
> > #include <linux/mm.h>
> > #include <linux/pagemap.h>
> > #include <linux/tracepoint-defs.h>
> > +#include <linux/compaction.h>
> >
> > /*
> > * The set of flags that only affect watermark checking and reclaim
> > @@ -203,6 +204,7 @@ struct compact_control {
> > bool whole_zone; /* Whole zone should/has been
> scanned */
> > bool contended; /* Signal lock or sched
> contention */
> > bool rescan; /* Rescanning the same pageblock */
> > + compact_finished_cb compact_finished_cb;
> > };
> >
> > /*
> > --
> > 2.21.0
>
> --
> Michal Hocko
> SUSE Labs