Re: [RFC][PATCH] mm: cut down __GFP_NORETRY page allocation failures

From: Dave Young
Date: Tue May 03 2011 - 21:56:55 EST


On Thu, Apr 28, 2011 at 9:36 PM, Wu Fengguang <fengguang.wu@xxxxxxxxx> wrote:
> Concurrent page allocations are suffering from high failure rates.
>
> On a 8p, 3GB ram test box, when reading 1000 sparse files of size 1GB,
> the page allocation failures are
>
> nr_alloc_fail 733 Â Â Â # interleaved reads by 1 single task
> nr_alloc_fail 11799 Â Â # concurrent reads by 1000 tasks
>
> The concurrent read test script is:
>
> Â Â Â Âfor i in `seq 1000`
> Â Â Â Âdo
> Â Â Â Â Â Â Â Âtruncate -s 1G /fs/sparse-$i
> Â Â Â Â Â Â Â Âdd if=/fs/sparse-$i of=/dev/null &
> Â Â Â Âdone
>

With Core2 Duo, 3G ram, No swap partition I can not produce the alloc fail

> In order for get_page_from_freelist() to get free page,
>
> (1) try_to_free_pages() should use much higher .nr_to_reclaim than the
> Â Âcurrent SWAP_CLUSTER_MAX=32, in order to draw the zone out of the
> Â Âpossible low watermark state as well as fill the pcp with enough free
> Â Âpages to overflow its high watermark.
>
> (2) the get_page_from_freelist() _after_ direct reclaim should use lower
> Â Âwatermark than its normal invocations, so that it can reasonably
> Â Â"reserve" some free pages for itself and prevent other concurrent
> Â Âpage allocators stealing all its reclaimed pages.
>
> Some notes:
>
> - commit 9ee493ce ("mm: page allocator: drain per-cpu lists after direct
> Âreclaim allocation fails") has the same target, however is obviously
> Âcostly and less effective. It seems more clean to just remove the
> Âretry and drain code than to retain it.
>
> - it's a bit hacky to reclaim more than requested pages inside
> Âdo_try_to_free_page(), and it won't help cgroup for now
>
> - it only aims to reduce failures when there are plenty of reclaimable
> Âpages, so it stops the opportunistic reclaim when scanned 2 times pages
>
> Test results:
>
> - the failure rate is pretty sensible to the page reclaim size,
> Âfrom 282 (WMARK_HIGH) to 704 (WMARK_MIN) to 10496 (SWAP_CLUSTER_MAX)
>
> - the IPIs are reduced by over 100 times
>
> base kernel: vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocation patch
> -------------------------------------------------------------------------------
> nr_alloc_fail 10496
> allocstall 1576602
>
> slabs_scanned 21632
> kswapd_steal 4393382
> kswapd_inodesteal 124
> kswapd_low_wmark_hit_quickly 885
> kswapd_high_wmark_hit_quickly 2321
> kswapd_skip_congestion_wait 0
> pageoutrun 29426
>
> CAL: Â Â 220449 Â Â 220246 Â Â 220372 Â Â 220558 Â Â 220251 Â Â 219740 Â Â 220043 Â Â 219968 Â Function call interrupts
>
> LOC: Â Â 536274 Â Â 532529 Â Â 531734 Â Â 536801 Â Â 536510 Â Â 533676 Â Â 534853 Â Â 532038 Â Local timer interrupts
> RES: Â Â Â 3032 Â Â Â 2128 Â Â Â 1792 Â Â Â 1765 Â Â Â 2184 Â Â Â 1703 Â Â Â 1754 Â Â Â 1865 Â Rescheduling interrupts
> TLB: Â Â Â Â189 Â Â Â Â 15 Â Â Â Â 13 Â Â Â Â 17 Â Â Â Â 64 Â Â Â Â294 Â Â Â Â 97 Â Â Â Â 63 Â TLB shootdowns

Could you tell how to get above info?

>
> patched (WMARK_MIN)
> -------------------
> nr_alloc_fail 704
> allocstall 105551
>
> slabs_scanned 33280
> kswapd_steal 4525537
> kswapd_inodesteal 187
> kswapd_low_wmark_hit_quickly 4980
> kswapd_high_wmark_hit_quickly 2573
> kswapd_skip_congestion_wait 0
> pageoutrun 35429
>
> CAL: Â Â Â Â 93 Â Â Â Â286 Â Â Â Â396 Â Â Â Â754 Â Â Â Â272 Â Â Â Â297 Â Â Â Â275 Â Â Â Â281 Â Function call interrupts
>
> LOC: Â Â 520550 Â Â 517751 Â Â 517043 Â Â 522016 Â Â 520302 Â Â 518479 Â Â 519329 Â Â 517179 Â Local timer interrupts
> RES: Â Â Â 2131 Â Â Â 1371 Â Â Â 1376 Â Â Â 1269 Â Â Â 1390 Â Â Â 1181 Â Â Â 1409 Â Â Â 1280 Â Rescheduling interrupts
> TLB: Â Â Â Â280 Â Â Â Â 26 Â Â Â Â 27 Â Â Â Â 30 Â Â Â Â 65 Â Â Â Â305 Â Â Â Â134 Â Â Â Â 75 Â TLB shootdowns
>
> patched (WMARK_HIGH)
> --------------------
> nr_alloc_fail 282
> allocstall 53860
>
> slabs_scanned 23936
> kswapd_steal 4561178
> kswapd_inodesteal 0
> kswapd_low_wmark_hit_quickly 2760
> kswapd_high_wmark_hit_quickly 1748
> kswapd_skip_congestion_wait 0
> pageoutrun 32639
>
> CAL: Â Â Â Â 93 Â Â Â Â463 Â Â Â Â410 Â Â Â Â540 Â Â Â Â298 Â Â Â Â282 Â Â Â Â272 Â Â Â Â306 Â Function call interrupts
>
> LOC: Â Â 513956 Â Â 510749 Â Â 509890 Â Â 514897 Â Â 514300 Â Â 512392 Â Â 512825 Â Â 510574 Â Local timer interrupts
> RES: Â Â Â 1174 Â Â Â 2081 Â Â Â 1411 Â Â Â 1320 Â Â Â 1742 Â Â Â 2683 Â Â Â 1380 Â Â Â 1230 Â Rescheduling interrupts
> TLB: Â Â Â Â274 Â Â Â Â 21 Â Â Â Â 19 Â Â Â Â 22 Â Â Â Â 57 Â Â Â Â317 Â Â Â Â131 Â Â Â Â 61 Â TLB shootdowns
>
> this patch (WMARK_HIGH, limited scan)
> -------------------------------------
> nr_alloc_fail 276
> allocstall 54034
>
> slabs_scanned 24320
> kswapd_steal 4507482
> kswapd_inodesteal 262
> kswapd_low_wmark_hit_quickly 2638
> kswapd_high_wmark_hit_quickly 1710
> kswapd_skip_congestion_wait 0
> pageoutrun 32182
>
> CAL: Â Â Â Â 69 Â Â Â Â443 Â Â Â Â421 Â Â Â Â567 Â Â Â Â273 Â Â Â Â279 Â Â Â Â269 Â Â Â Â334 Â Function call interrupts
>
> LOC: Â Â 514736 Â Â 511698 Â Â 510993 Â Â 514069 Â Â 514185 Â Â 512986 Â Â 513838 Â Â 511229 Â Local timer interrupts
> RES: Â Â Â 2153 Â Â Â 1556 Â Â Â 1126 Â Â Â 1351 Â Â Â 3047 Â Â Â 1554 Â Â Â 1131 Â Â Â 1560 Â Rescheduling interrupts
> TLB: Â Â Â Â209 Â Â Â Â 26 Â Â Â Â 20 Â Â Â Â 15 Â Â Â Â 71 Â Â Â Â315 Â Â Â Â117 Â Â Â Â 71 Â TLB shootdowns
>
> CC: Mel Gorman <mel@xxxxxxxxxxxxxxxxxx>
> Signed-off-by: Wu Fengguang <fengguang.wu@xxxxxxxxx>
> ---
> Âmm/page_alloc.c | Â 17 +++--------------
> Âmm/vmscan.c   |  Â6 ++++++
> Â2 files changed, 9 insertions(+), 14 deletions(-)
> --- linux-next.orig/mm/vmscan.c 2011-04-28 21:16:16.000000000 +0800
> +++ linux-next/mm/vmscan.c   Â2011-04-28 21:28:57.000000000 +0800
> @@ -1978,6 +1978,8 @@ static void shrink_zones(int priority, s
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âcontinue;
> Â Â Â Â Â Â Â Â Â Â Â Âif (zone->all_unreclaimable && priority != DEF_PRIORITY)
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âcontinue; Â Â Â /* Let kswapd poll it */
> + Â Â Â Â Â Â Â Â Â Â Â sc->nr_to_reclaim = max(sc->nr_to_reclaim,
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â zone->watermark[WMARK_HIGH]);
> Â Â Â Â Â Â Â Â}
>
> Â Â Â Â Â Â Â Âshrink_zone(priority, zone, sc);
> @@ -2034,6 +2036,7 @@ static unsigned long do_try_to_free_page
> Â Â Â Âstruct zoneref *z;
> Â Â Â Âstruct zone *zone;
> Â Â Â Âunsigned long writeback_threshold;
> + Â Â Â unsigned long min_reclaim = sc->nr_to_reclaim;
>
> Â Â Â Âget_mems_allowed();
> Â Â Â Âdelayacct_freepages_start();
> @@ -2067,6 +2070,9 @@ static unsigned long do_try_to_free_page
> Â Â Â Â Â Â Â Â Â Â Â Â}
> Â Â Â Â Â Â Â Â}
> Â Â Â Â Â Â Â Âtotal_scanned += sc->nr_scanned;
> + Â Â Â Â Â Â Â if (sc->nr_reclaimed >= min_reclaim &&
> + Â Â Â Â Â Â Â Â Â total_scanned > 2 * sc->nr_to_reclaim)
> + Â Â Â Â Â Â Â Â Â Â Â goto out;
> Â Â Â Â Â Â Â Âif (sc->nr_reclaimed >= sc->nr_to_reclaim)
> Â Â Â Â Â Â Â Â Â Â Â Âgoto out;
>
> --- linux-next.orig/mm/page_alloc.c   2011-04-28 21:16:16.000000000 +0800
> +++ linux-next/mm/page_alloc.c Â2011-04-28 21:16:18.000000000 +0800
> @@ -1888,9 +1888,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
> Â Â Â Ânodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
> Â Â Â Âint migratetype, unsigned long *did_some_progress)
> Â{
> - Â Â Â struct page *page = NULL;
> + Â Â Â struct page *page;
> Â Â Â Âstruct reclaim_state reclaim_state;
> - Â Â Â bool drained = false;
>
> Â Â Â Âcond_resched();
>
> @@ -1912,22 +1911,12 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
> Â Â Â Âif (unlikely(!(*did_some_progress)))
> Â Â Â Â Â Â Â Âreturn NULL;
>
> -retry:
> + Â Â Â alloc_flags |= ALLOC_HARDER;
> +
> Â Â Â Âpage = get_page_from_freelist(gfp_mask, nodemask, order,
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âzonelist, high_zoneidx,
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âalloc_flags, preferred_zone,
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âmigratetype);
> -
> - Â Â Â /*
> - Â Â Â Â* If an allocation failed after direct reclaim, it could be because
> - Â Â Â Â* pages are pinned on the per-cpu lists. Drain them and try again
> - Â Â Â Â*/
> - Â Â Â if (!page && !drained) {
> - Â Â Â Â Â Â Â drain_all_pages();
> - Â Â Â Â Â Â Â drained = true;
> - Â Â Â Â Â Â Â goto retry;
> - Â Â Â }
> -
> Â Â Â Âreturn page;
> Â}
>
>



--
Regards
dave
¢éì®&Þ~º&¶¬–+-±éÝ¥Šw®žË±Êâmébžìdz¹Þ)í…æèw*jg¬±¨¶‰šŽŠÝj/êäz¹ÞŠà2ŠÞ¨è­Ú&¢)ß«a¶Úþø®G«éh®æj:+v‰¨Šwè†Ù>Wš±êÞiÛaxPjØm¶Ÿÿà -»+ƒùdš_