Re: [PATCH] mm: Add callback for defining compaction completion

From: Michal Hocko
Date: Tue Sep 10 2019 - 16:25:37 EST


On Tue 10-09-19 13:07:32, Nitin Gupta wrote:
> For some applications we need to allocate almost all memory as hugepages.
> However, on a running system, higher order allocations can fail if the
> memory is fragmented. Linux kernel currently does on-demand compaction as
> we request more hugepages but this style of compaction incurs very high
> latency. Experiments with one-time full memory compaction (followed by
> hugepage allocations) shows that kernel is able to restore a highly
> fragmented memory state to a fairly compacted memory state within <1 sec
> for a 32G system. Such data suggests that a more proactive compaction can
> help us allocate a large fraction of memory as hugepages keeping allocation
> latencies low.
>
> In general, compaction can introduce unexpected latencies for applications
> that don't even have strong requirements for contiguous allocations. It is
> also hard to efficiently determine if the current system state can be
> easily compacted due to mixing of unmovable memory. Due to these reasons,
> automatic background compaction by the kernel itself is hard to get right
> in a way which does not hurt unsuspecting applications or waste CPU cycles.

We do trigger background compaction on a high order pressure from the
page allocator by waking up kcompactd. Why is that not sufficient?

> Even with these caveats, pro-active compaction can still be very useful in
> certain scenarios to reduce hugepage allocation latencies. This callback
> interface allows drivers to drive compaction based on their own policies
> like the current level of external fragmentation for a particular order,
> system load etc.

So we do not trust the core MM to make a reasonable decision while we
give a free ticket to modules. How does this make any sense at all? How
is a random module going to make a more informed decision when it has
less visibility on the overal MM situation.

If you need to control compaction from the userspace you have an
interface for that. It is also completely unexplained why you need a
completion callback.

That being said, this looks like a terrible idea to me.

> Signed-off-by: Nitin Gupta <nigupta@xxxxxxxxxx>
> ---
> include/linux/compaction.h | 10 ++++++++++
> mm/compaction.c | 20 ++++++++++++++------
> mm/internal.h | 2 ++
> 3 files changed, 26 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 9569e7c786d3..1ea828450fa2 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -58,6 +58,16 @@ enum compact_result {
> COMPACT_SUCCESS,
> };
>
> +/* Callback function to determine if compaction is finished. */
> +typedef enum compact_result (*compact_finished_cb)(
> + struct zone *zone, int order);
> +
> +enum compact_result compact_zone_order(struct zone *zone, int order,
> + gfp_t gfp_mask, enum compact_priority prio,
> + unsigned int alloc_flags, int classzone_idx,
> + struct page **capture,
> + compact_finished_cb compact_finished_cb);
> +
> struct alloc_context; /* in mm/internal.h */
>
> /*
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 952dc2fb24e5..73e2e9246bc4 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1872,6 +1872,9 @@ static enum compact_result __compact_finished(struct compact_control *cc)
> return COMPACT_PARTIAL_SKIPPED;
> }
>
> + if (cc->compact_finished_cb)
> + return cc->compact_finished_cb(cc->zone, cc->order);
> +
> if (is_via_compact_memory(cc->order))
> return COMPACT_CONTINUE;
>
> @@ -2274,10 +2277,11 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
> return ret;
> }
>
> -static enum compact_result compact_zone_order(struct zone *zone, int order,
> +enum compact_result compact_zone_order(struct zone *zone, int order,
> gfp_t gfp_mask, enum compact_priority prio,
> unsigned int alloc_flags, int classzone_idx,
> - struct page **capture)
> + struct page **capture,
> + compact_finished_cb compact_finished_cb)
> {
> enum compact_result ret;
> struct compact_control cc = {
> @@ -2293,10 +2297,11 @@ static enum compact_result compact_zone_order(struct zone *zone, int order,
> MIGRATE_ASYNC : MIGRATE_SYNC_LIGHT,
> .alloc_flags = alloc_flags,
> .classzone_idx = classzone_idx,
> - .direct_compaction = true,
> + .direct_compaction = !compact_finished_cb,
> .whole_zone = (prio == MIN_COMPACT_PRIORITY),
> .ignore_skip_hint = (prio == MIN_COMPACT_PRIORITY),
> - .ignore_block_suitable = (prio == MIN_COMPACT_PRIORITY)
> + .ignore_block_suitable = (prio == MIN_COMPACT_PRIORITY),
> + .compact_finished_cb = compact_finished_cb
> };
> struct capture_control capc = {
> .cc = &cc,
> @@ -2313,11 +2318,13 @@ static enum compact_result compact_zone_order(struct zone *zone, int order,
> VM_BUG_ON(!list_empty(&cc.freepages));
> VM_BUG_ON(!list_empty(&cc.migratepages));
>
> - *capture = capc.page;
> + if (capture)
> + *capture = capc.page;
> current->capture_control = NULL;
>
> return ret;
> }
> +EXPORT_SYMBOL(compact_zone_order);
>
> int sysctl_extfrag_threshold = 500;
>
> @@ -2361,7 +2368,8 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
> }
>
> status = compact_zone_order(zone, order, gfp_mask, prio,
> - alloc_flags, ac_classzone_idx(ac), capture);
> + alloc_flags, ac_classzone_idx(ac), capture,
> + NULL);
> rc = max(status, rc);
>
> /* The allocation should succeed, stop compacting */
> diff --git a/mm/internal.h b/mm/internal.h
> index e32390802fd3..f873f7c2b9dc 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -11,6 +11,7 @@
> #include <linux/mm.h>
> #include <linux/pagemap.h>
> #include <linux/tracepoint-defs.h>
> +#include <linux/compaction.h>
>
> /*
> * The set of flags that only affect watermark checking and reclaim
> @@ -203,6 +204,7 @@ struct compact_control {
> bool whole_zone; /* Whole zone should/has been scanned */
> bool contended; /* Signal lock or sched contention */
> bool rescan; /* Rescanning the same pageblock */
> + compact_finished_cb compact_finished_cb;
> };
>
> /*
> --
> 2.21.0

--
Michal Hocko
SUSE Labs