Re: [RFC][PATCH] mm: cut down __GFP_NORETRY page allocationfailures

From: Minchan Kim
Date: Sun May 01 2011 - 12:36:15 EST


Hi Wu,

On Sat, Apr 30, 2011 at 10:17:41PM +0800, Wu Fengguang wrote:
> On Fri, Apr 29, 2011 at 10:28:24AM +0800, Wu Fengguang wrote:
> > > Test results:
> > >
> > > - the failure rate is pretty sensible to the page reclaim size,
> > > from 282 (WMARK_HIGH) to 704 (WMARK_MIN) to 10496 (SWAP_CLUSTER_MAX)
> > >
> > > - the IPIs are reduced by over 100 times
> >
> > It's reduced by 500 times indeed.
> >
> > CAL: 220449 220246 220372 220558 220251 219740 220043 219968 Function call interrupts
> > CAL: 93 463 410 540 298 282 272 306 Function call interrupts
> >
> > > base kernel: vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocation patch
> > > -------------------------------------------------------------------------------
> > > nr_alloc_fail 10496
> > > allocstall 1576602
> >
> > > patched (WMARK_MIN)
> > > -------------------
> > > nr_alloc_fail 704
> > > allocstall 105551
> >
> > > patched (WMARK_HIGH)
> > > --------------------
> > > nr_alloc_fail 282
> > > allocstall 53860
> >
> > > this patch (WMARK_HIGH, limited scan)
> > > -------------------------------------
> > > nr_alloc_fail 276
> > > allocstall 54034
> >
> > There is a bad side effect though: the much reduced "allocstall" means
> > each direct reclaim will take much more time to complete. A simple solution
> > is to terminate direct reclaim after 10ms. I noticed that an 100ms
> > time threshold can reduce the reclaim latency from 621ms to 358ms.
> > Further lowering the time threshold to 20ms does not help reducing the
> > real latencies though.
>
> Experiments going on...
>
> I tried the more reasonable terminate condition: stop direct reclaim
> when the preferred zone is above high watermark (see the below chunk).
>
> This helps reduce the average reclaim latency to under 100ms in the
> 1000-dd case.
>
> However nr_alloc_fail is around 5000 and not ideal. The interesting
> thing is, even if zone watermark is high, the task still may fail to
> get a free page..
>
> @@ -2067,8 +2072,17 @@ static unsigned long do_try_to_free_page
> }
> }
> total_scanned += sc->nr_scanned;
> - if (sc->nr_reclaimed >= sc->nr_to_reclaim)
> - goto out;
> + if (sc->nr_reclaimed >= min_reclaim) {
> + if (sc->nr_reclaimed >= sc->nr_to_reclaim)
> + goto out;
> + if (total_scanned > 2 * sc->nr_to_reclaim)
> + goto out;
> + if (preferred_zone &&
> + zone_watermark_ok_safe(preferred_zone, sc->order,
> + high_wmark_pages(preferred_zone),
> + zone_idx(preferred_zone), 0))
> + goto out;
> + }
>
> /*
> * Try to write back as many pages as we just scanned. This
>
> Thanks,
> Fengguang
> ---
> Subject: mm: cut down __GFP_NORETRY page allocation failures
> Date: Thu Apr 28 13:46:39 CST 2011
>
> Concurrent page allocations are suffering from high failure rates.
>
> On a 8p, 3GB ram test box, when reading 1000 sparse files of size 1GB,
> the page allocation failures are
>
> nr_alloc_fail 733 # interleaved reads by 1 single task
> nr_alloc_fail 11799 # concurrent reads by 1000 tasks
>
> The concurrent read test script is:
>
> for i in `seq 1000`
> do
> truncate -s 1G /fs/sparse-$i
> dd if=/fs/sparse-$i of=/dev/null &
> done
>
> In order for get_page_from_freelist() to get free page,
>
> (1) try_to_free_pages() should use much higher .nr_to_reclaim than the
> current SWAP_CLUSTER_MAX=32, in order to draw the zone out of the
> possible low watermark state as well as fill the pcp with enough free
> pages to overflow its high watermark.
>
> (2) the get_page_from_freelist() _after_ direct reclaim should use lower
> watermark than its normal invocations, so that it can reasonably
> "reserve" some free pages for itself and prevent other concurrent
> page allocators stealing all its reclaimed pages.

Do you see my old patch? The patch want't incomplet but it's not bad for showing an idea.
http://marc.info/?l=linux-mm&m=129187231129887&w=4
The idea is to keep a page at leat for direct reclaimed process.
Could it mitigate your problem or could you enhacne the idea?
I think it's very simple and fair solution.

>
> Some notes:
>
> - commit 9ee493ce ("mm: page allocator: drain per-cpu lists after direct
> reclaim allocation fails") has the same target, however is obviously
> costly and less effective. It seems more clean to just remove the
> retry and drain code than to retain it.

Tend to agree.
My old patch can solve it, I think.

>
> - it's a bit hacky to reclaim more than requested pages inside
> do_try_to_free_page(), and it won't help cgroup for now
>
> - it only aims to reduce failures when there are plenty of reclaimable
> pages, so it stops the opportunistic reclaim when scanned 2 times pages
>
> Test results:
>
> - the failure rate is pretty sensible to the page reclaim size,
> from 282 (WMARK_HIGH) to 704 (WMARK_MIN) to 10496 (SWAP_CLUSTER_MAX)
>
> - the IPIs are reduced by over 100 times
>
> base kernel: vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocation patch
> -------------------------------------------------------------------------------
> nr_alloc_fail 10496
> allocstall 1576602
>
> slabs_scanned 21632
> kswapd_steal 4393382
> kswapd_inodesteal 124
> kswapd_low_wmark_hit_quickly 885
> kswapd_high_wmark_hit_quickly 2321
> kswapd_skip_congestion_wait 0
> pageoutrun 29426
>
> CAL: 220449 220246 220372 220558 220251 219740 220043 219968 Function call interrupts
>
> LOC: 536274 532529 531734 536801 536510 533676 534853 532038 Local timer interrupts
> RES: 3032 2128 1792 1765 2184 1703 1754 1865 Rescheduling interrupts
> TLB: 189 15 13 17 64 294 97 63 TLB shootdowns
>
> patched (WMARK_MIN)
> -------------------
> nr_alloc_fail 704
> allocstall 105551
>
> slabs_scanned 33280
> kswapd_steal 4525537
> kswapd_inodesteal 187
> kswapd_low_wmark_hit_quickly 4980
> kswapd_high_wmark_hit_quickly 2573
> kswapd_skip_congestion_wait 0
> pageoutrun 35429
>
> CAL: 93 286 396 754 272 297 275 281 Function call interrupts
>
> LOC: 520550 517751 517043 522016 520302 518479 519329 517179 Local timer interrupts
> RES: 2131 1371 1376 1269 1390 1181 1409 1280 Rescheduling interrupts
> TLB: 280 26 27 30 65 305 134 75 TLB shootdowns
>
> patched (WMARK_HIGH)
> --------------------
> nr_alloc_fail 282
> allocstall 53860
>
> slabs_scanned 23936
> kswapd_steal 4561178
> kswapd_inodesteal 0
> kswapd_low_wmark_hit_quickly 2760
> kswapd_high_wmark_hit_quickly 1748
> kswapd_skip_congestion_wait 0
> pageoutrun 32639
>
> CAL: 93 463 410 540 298 282 272 306 Function call interrupts
>
> LOC: 513956 510749 509890 514897 514300 512392 512825 510574 Local timer interrupts
> RES: 1174 2081 1411 1320 1742 2683 1380 1230 Rescheduling interrupts
> TLB: 274 21 19 22 57 317 131 61 TLB shootdowns
>
> patched (WMARK_HIGH, limited scan)
> ----------------------------------
> nr_alloc_fail 276
> allocstall 54034
>
> slabs_scanned 24320
> kswapd_steal 4507482
> kswapd_inodesteal 262
> kswapd_low_wmark_hit_quickly 2638
> kswapd_high_wmark_hit_quickly 1710
> kswapd_skip_congestion_wait 0
> pageoutrun 32182
>
> CAL: 69 443 421 567 273 279 269 334 Function call interrupts

Looks amazing.

>
> LOC: 514736 511698 510993 514069 514185 512986 513838 511229 Local timer interrupts
> RES: 2153 1556 1126 1351 3047 1554 1131 1560 Rescheduling interrupts
> TLB: 209 26 20 15 71 315 117 71 TLB shootdowns
>
> patched (WMARK_HIGH, limited scan, stop on watermark OK), 100 dd
> ----------------------------------------------------------------
>
> start time: 3
> total time: 50
> nr_alloc_fail 162
> allocstall 45523
>
> CPU count real total virtual total delay total
> 921 3024540200 3009244668 37123129525
> IO count delay total delay average
> 0 0 0ms
> SWAP count delay total delay average
> 0 0 0ms
> RECLAIM count delay total delay average
> 357 4891766796 13ms
> dd: read=0, write=0, cancelled_write=0
>
> patched (WMARK_HIGH, limited scan, stop on watermark OK), 1000 dd
> -----------------------------------------------------------------
>
> start time: 272
> total time: 509
> nr_alloc_fail 3913
> allocstall 541789
>
> CPU count real total virtual total delay total
> 1044 3445476208 3437200482 229919915202
> IO count delay total delay average
> 0 0 0ms
> SWAP count delay total delay average
> 0 0 0ms
> RECLAIM count delay total delay average
> 452 34691441605 76ms
> dd: read=0, write=0, cancelled_write=0
>
> patched (WMARK_HIGH, limited scan, stop on watermark OK, no time limit), 1000 dd
> --------------------------------------------------------------------------------
>
> start time: 278
> total time: 513
> nr_alloc_fail 4737
> allocstall 436392
>
>
> CPU count real total virtual total delay total
> 1024 3371487456 3359441487 225088210977
> IO count delay total delay average
> 1 160631171 160ms
> SWAP count delay total delay average
> 0 0 0ms
> RECLAIM count delay total delay average
> 367 30809994722 83ms
> dd: read=20480, write=0, cancelled_write=0
>
>
> no cond_resched():

What's this?

>
> start time: 263
> total time: 516
> nr_alloc_fail 5144
> allocstall 436787
>
> CPU count real total virtual total delay total
> 1018 3305497488 3283831119 241982934044
> IO count delay total delay average
> 0 0 0ms
> SWAP count delay total delay average
> 0 0 0ms
> RECLAIM count delay total delay average
> 328 31398481378 95ms
> dd: read=0, write=0, cancelled_write=0
>
> zone_watermark_ok_safe():
>
> start time: 266
> total time: 513
> nr_alloc_fail 4526
> allocstall 440246
>
> CPU count real total virtual total delay total
> 1119 3640446568 3619184439 240945024724
> IO count delay total delay average
> 3 303620082 101ms
> SWAP count delay total delay average
> 0 0 0ms
> RECLAIM count delay total delay average
> 372 27320731898 73ms
> dd: read=77824, write=0, cancelled_write=0
>
>
> start time: 275

What's meaing of start time?

> total time: 517

Total time is elapsed time on your experiment?

> nr_alloc_fail 4694
> allocstall 431021
>
>
> CPU count real total virtual total delay total
> 1073 3534462680 3512544928 234056498221

What's meaning of CPU fields?

> IO count delay total delay average
> 0 0 0ms
> SWAP count delay total delay average
> 0 0 0ms
> RECLAIM count delay total delay average
> 386 34751778363 89ms
> dd: read=0, write=0, cancelled_write=0
>

Where is vanilla data for comparing latency?
Personally, It's hard to parse your data.


> CC: Mel Gorman <mel@xxxxxxxxxxxxxxxxxx>
> Signed-off-by: Wu Fengguang <fengguang.wu@xxxxxxxxx>
> ---
> fs/buffer.c | 4 ++--
> include/linux/swap.h | 3 ++-
> mm/page_alloc.c | 20 +++++---------------
> mm/vmscan.c | 31 +++++++++++++++++++++++--------
> 4 files changed, 32 insertions(+), 26 deletions(-)
> --- linux-next.orig/mm/vmscan.c 2011-04-29 10:42:14.000000000 +0800
> +++ linux-next/mm/vmscan.c 2011-04-30 21:59:33.000000000 +0800
> @@ -2025,8 +2025,9 @@ static bool all_unreclaimable(struct zon
> * returns: 0, if no pages reclaimed
> * else, the number of pages reclaimed
> */
> -static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> - struct scan_control *sc)
> +static unsigned long do_try_to_free_pages(struct zone *preferred_zone,
> + struct zonelist *zonelist,
> + struct scan_control *sc)
> {
> int priority;
> unsigned long total_scanned = 0;
> @@ -2034,6 +2035,7 @@ static unsigned long do_try_to_free_page
> struct zoneref *z;
> struct zone *zone;
> unsigned long writeback_threshold;
> + unsigned long min_reclaim = sc->nr_to_reclaim;

Hmm,

>
> get_mems_allowed();
> delayacct_freepages_start();
> @@ -2041,6 +2043,9 @@ static unsigned long do_try_to_free_page
> if (scanning_global_lru(sc))
> count_vm_event(ALLOCSTALL);
>
> + if (preferred_zone)
> + sc->nr_to_reclaim += preferred_zone->watermark[WMARK_HIGH];
> +

Hmm, I don't like this idea.
The goal of direct reclaim path is to reclaim pages asap, I beleive.
Many thing should be achieve of background kswapd.
If admin changes min_free_kbytes, it can affect latency of direct reclaim.
It doesn't make sense to me.


> for (priority = DEF_PRIORITY; priority >= 0; priority--) {
> sc->nr_scanned = 0;
> if (!priority)
> @@ -2067,8 +2072,17 @@ static unsigned long do_try_to_free_page
> }
> }
> total_scanned += sc->nr_scanned;
> - if (sc->nr_reclaimed >= sc->nr_to_reclaim)
> - goto out;
> + if (sc->nr_reclaimed >= min_reclaim) {
> + if (sc->nr_reclaimed >= sc->nr_to_reclaim)
> + goto out;

I can't understand the logic.
if nr_reclaimed is bigger than min_reclaim, it's always greater than
nr_to_reclaim. What's meaning of min_reclaim?


> + if (total_scanned > 2 * sc->nr_to_reclaim)
> + goto out;

If there are lots of dirty pages in LRU?
If there are lots of unevictable pages in LRU?
If there are lots of mapped page in LRU but may_unmap = 0 cases?
I means it's rather risky early conclusion.


> + if (preferred_zone &&
> + zone_watermark_ok_safe(preferred_zone, sc->order,
> + high_wmark_pages(preferred_zone),
> + zone_idx(preferred_zone), 0))
> + goto out;
> + }

As I said, I think direct reclaim path sould be fast if possbile and
it should not a function of min_free_kbytes.
Of course, there are lots of tackle for keep direct reclaim path's consistent
latency but at least, I don't want to add another source.


--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/