Re: [PATCH v2 -mm 1/3] mm: reclaim at order 0 when compaction isenabled

From: Mel Gorman
Date: Wed Jan 25 2012 - 10:00:18 EST


On Tue, Jan 24, 2012 at 01:21:36PM -0500, Rik van Riel wrote:
> When built with CONFIG_COMPACTION, kswapd does not try to free
> contiguous pages.

balance_pgdat() gets its order from wakeup_kswapd(). This does not apply
to THP because kswapd does not get woken for THP but it should be woken
up for allocations like jumbo frames or order-1.

This order is passed to shrink_zone(), shrink_list, shrink_inactive_list
and to isolate_lru_pages() which will isolate pages within a naturally
aligned boundary. This happens even with CONFIG_COMPACTION enabled.
There have been a number of changes in vmscan.c recently so maybe we are
looking at different versions but why do you say it doesn't free
contiguous pages?

That said, this is lumpy-reclaim-like behaviour and we want to move
away from it. There is no guarantee that there are pages that can be
isolated in the naturally aligned region which is particularly true if
swap is disabled. This has not happened yet as each cycle had changes
to compaction that took priority (the most recent being sync-light). I'm
glad you picked this up, so thanks.

As kswapd does no memory compaction itself, this patch still makes
sense but I found the changelog misleading.

> Because it is not trying, it should also not
> test whether it succeeded, because that can result in continuous
> page reclaim, until a large fraction of memory is free and large
> fractions of the working set have been evicted.
>
> In shrink_inactive_list, we should not try to do higher order
> (out of LRU order) page isolation, unless we really are in
> lumpy reclaim mode. This gives all pages a good amount of time
> on the inactive list, giving the actively used pages the chance
> to get referenced and avoid eviction.
>
> Also remove a line of code that increments balanced right before
> exiting the function.
>
> Signed-off-by: Rik van Riel <riel@xxxxxxxxxx>
> ---
> mm/vmscan.c | 29 ++++++++++++++++++++++-------
> 1 files changed, 22 insertions(+), 7 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2880396..0398fab 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1512,6 +1512,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct mem_cgroup_zone *mz,
> unsigned long nr_writeback = 0;
> isolate_mode_t reclaim_mode = ISOLATE_INACTIVE;
> struct zone *zone = mz->zone;
> + int order = 0;
>
> while (unlikely(too_many_isolated(zone, file, sc))) {
> congestion_wait(BLK_RW_ASYNC, HZ/10);
> @@ -1522,8 +1523,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct mem_cgroup_zone *mz,
> }
>
> set_reclaim_mode(priority, sc, false);
> - if (sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM)
> + if (sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM) {
> reclaim_mode |= ISOLATE_ACTIVE;
> + order = sc->order;
> + }
>
> lru_add_drain();
>

This is a nit-pick but I would far prefer if you did not bypass
sc->order like this and instead changes isolate_lru_pages to do a

if (!order || !(sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM))
continue;

That would very clearly mark where LUMPYRECLAIM takes effect in
isolate_lru_pages() and makes deleting LUMPYRECLAIM easier in the
future.

The second effect of this change is a non-obvious side-effect. kswapd
will now isolate fewer pages per cycle because it will isolate
SWAP_CLUSTER_MAX pages instead of SWAP_CLUSTER_MAX<<order which it
potentially does currently. This is not wrong as such and may be
desirable to limit how much reclaim kswapd does but potentially it
impacts success rates for compaction. As this does not apply to THP,
it will be difficult to detect but bear in mind if we see an increase
in high-order allocation failures after this patch is merged. I am
not suggesting a change here but it would be nice to note in the
changelog if there is a new version of this patch.

> @@ -1535,7 +1538,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct mem_cgroup_zone *mz,
> spin_lock_irq(&zone->lru_lock);
>
> nr_taken = isolate_lru_pages(nr_to_scan, mz, &page_list,
> - &nr_scanned, sc->order,
> + &nr_scanned, order,
> reclaim_mode, 0, file);
> if (global_reclaim(sc)) {
> zone->pages_scanned += nr_scanned;
> @@ -2754,7 +2757,7 @@ loop_again:
> */
> for (i = 0; i <= end_zone; i++) {
> struct zone *zone = pgdat->node_zones + i;
> - int nr_slab;
> + int nr_slab, testorder;
> unsigned long balance_gap;
>
> if (!populated_zone(zone))
> @@ -2783,11 +2786,25 @@ loop_again:
> * gap is either the low watermark or 1%
> * of the zone, whichever is smaller.
> */
> + testorder = order;
> balance_gap = min(low_wmark_pages(zone),
> (zone->present_pages +
> KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
> KSWAPD_ZONE_BALANCE_GAP_RATIO);
> - if (!zone_watermark_ok_safe(zone, order,
> + /*
> + * Kswapd reclaims only single pages when
> + * COMPACTION_BUILD. Trying too hard to get
> + * contiguous free pages can result in excessive
> + * amounts of free memory, and useful things
> + * getting kicked out of memory.
> + * Limit the amount of reclaim to something sane,
> + * plus space for compaction to do its thing.
> + */
> + if (COMPACTION_BUILD) {
> + testorder = 0;
> + balance_gap += 2<<order;
> + }

This 2<<order logic now appears in a few places. I am not expecting
it to be handled in this patch but at some point the various different
balance_gap logics need to be put in once place.

> + if (!zone_watermark_ok_safe(zone, testorder,
> high_wmark_pages(zone) + balance_gap,
> end_zone, 0)) {
> shrink_zone(priority, zone, &sc);
> @@ -2816,7 +2833,7 @@ loop_again:
> continue;
> }
>
> - if (!zone_watermark_ok_safe(zone, order,
> + if (!zone_watermark_ok_safe(zone, testorder,
> high_wmark_pages(zone), end_zone, 0)) {
> all_zones_ok = 0;
> /*
> @@ -2922,8 +2939,6 @@ out:
>
> /* If balanced, clear the congested flag */
> zone_clear_flag(zone, ZONE_CONGESTED);
> - if (i <= *classzone_idx)
> - balanced += zone->present_pages;
> }

Why is this being deleted? It is still used by pgdat_balanced().

> }
>

--
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/