Re: [RFC PATCH 2/2] Don't continue reclaim if the system have plenty free memory

From: Minchan Kim
Date: Tue Jul 07 2009 - 09:20:28 EST


Hi, Kosaki.

On Tue, Jul 7, 2009 at 6:48 PM, KOSAKI
Motohiro<kosaki.motohiro@xxxxxxxxxxxxxx> wrote:
> Subject: [PATCH] Don't continue reclaim if the system have plenty free memory
>
> On concurrent reclaim situation, if one reclaimer makes OOM, maybe other
> reclaimer can stop reclaim because OOM killer makes enough free memory.
>
> But current kernel doesn't have its logic. Then, we can face following accidental
> 2nd OOM scenario.
>
> 1. System memory is used by only one big process.
> 2. memory shortage occur and concurrent reclaim start.
> 3. One reclaimer makes OOM and OOM killer kill above big process.
> 4. Almost reclaimable page will be freed.
> 5. Another reclaimer can't find any reclaimable page because those pages are
> Â already freed.
> 6. Then, system makes accidental and unnecessary 2nd OOM killer.
>

Did you see the this situation ?
Why I ask is that we have already a routine for preventing parallel
OOM killing in __alloc_pages_may_oom.

Couldn't it protect your scenario ?
If it can't, Could you explain the scenario in more detail ?

I think first we try to modify old routine with efficient.

>
> Plus, nowaday datacenter system have badboy process monitoring system and
> it kill too much memory consumption process.
> But it don't stop other reclaimer and it makes accidental 2nd OOM by the
> same reason.
>
>
> This patch have one good side effect. it increase reclaim depended benchmark
> performance.
>
> e.g.
> =====
> % ./hackbench 140 process 100
>
> before:
> Â Â Â ÂTime: 93.361
> after:
> Â Â Â ÂTime: 28.799
>
>
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxxxxxxx>
> ---
> Âfs/buffer.c     Â|  Â2 +-
> Âinclude/linux/swap.h | Â Â3 ++-
> Âmm/page_alloc.c   Â|  Â3 ++-
> Âmm/vmscan.c     Â|  29 ++++++++++++++++++++++++++++-
> Â4 files changed, 33 insertions(+), 4 deletions(-)
>
> Index: b/mm/vmscan.c
> ===================================================================
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -87,6 +87,9 @@ struct scan_control {
> Â Â Â Â */
>    Ânodemask_t   Â*nodemask;
>
> + Â Â Â /* Caller's preferred zone. */
> +    struct zone   *preferred_zone;
> +
> Â Â Â Â/* Pluggable isolate pages callback */
> Â Â Â Âunsigned long (*isolate_pages)(unsigned long nr, struct list_head *dst,
> Â Â Â Â Â Â Â Â Â Â Â Âunsigned long *scanned, int order, int mode,
> @@ -1535,6 +1538,10 @@ static void shrink_zone(int priority, st
> Â Â Â Âunsigned long nr_reclaimed = sc->nr_reclaimed;
> Â Â Â Âunsigned long swap_cluster_max = sc->swap_cluster_max;
> Â Â Â Âint noswap = 0;
> + Â Â Â int classzone_idx = 0;
> +
> + Â Â Â if (sc->preferred_zone)
> + Â Â Â Â Â Â Â classzone_idx = zone_idx(sc->preferred_zone);
>
> Â Â Â Â/* If we have no swap space, do not bother scanning anon pages. */
> Â Â Â Âif (!sc->may_swap || (nr_swap_pages <= 0)) {
> @@ -1583,6 +1590,20 @@ static void shrink_zone(int priority, st
> Â Â Â Â Â Â Â Âif (nr_reclaimed > swap_cluster_max &&
> Â Â Â Â Â Â Â Â Â Â Â Âpriority < DEF_PRIORITY && !current_is_kswapd())
> Â Â Â Â Â Â Â Â Â Â Â Âbreak;
> +
> + Â Â Â Â Â Â Â /*
> + Â Â Â Â Â Â Â Â* Now, we have plenty free memory.
> + Â Â Â Â Â Â Â Â* Perhaps, big processes exited or they killed by OOM killer.
> + Â Â Â Â Â Â Â Â* To continue reclaim doesn't make any sense.
> + Â Â Â Â Â Â Â Â*/
> + Â Â Â Â Â Â Â if (zone_page_state(zone, NR_FREE_PAGES) >
> + Â Â Â Â Â Â Â Â Â zone_lru_pages(zone) &&
> + Â Â Â Â Â Â Â Â Â zone_watermark_ok(zone, sc->order, high_wmark_pages(zone),
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â classzone_idx, 0)) {
> + Â Â Â Â Â Â Â Â Â Â Â /* fake result for reclaim stop */
> + Â Â Â Â Â Â Â Â Â Â Â nr_reclaimed += swap_cluster_max;
> + Â Â Â Â Â Â Â Â Â Â Â break;
> + Â Â Â Â Â Â Â }
> Â Â Â Â}
>
> Â Â Â Âsc->nr_reclaimed = nr_reclaimed;
> @@ -1767,7 +1788,8 @@ out:
> Â}
>
> Âunsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> - Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â gfp_t gfp_mask, nodemask_t *nodemask)
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â gfp_t gfp_mask, nodemask_t *nodemask,
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â struct zone *preferred_zone)
> Â{
> Â Â Â Âstruct scan_control sc = {
> Â Â Â Â Â Â Â Â.gfp_mask = gfp_mask,
> @@ -1780,6 +1802,7 @@ unsigned long try_to_free_pages(struct z
> Â Â Â Â Â Â Â Â.mem_cgroup = NULL,
> Â Â Â Â Â Â Â Â.isolate_pages = isolate_pages_global,
> Â Â Â Â Â Â Â Â.nodemask = nodemask,
> + Â Â Â Â Â Â Â .preferred_zone = preferred_zone,
> Â Â Â Â};
>
> Â Â Â Âreturn do_try_to_free_pages(zonelist, &sc);
> @@ -1808,6 +1831,10 @@ unsigned long try_to_free_mem_cgroup_pag
> Â Â Â Âsc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
> Â Â Â Â Â Â Â Â Â Â Â Â(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
> Â Â Â Âzonelist = NODE_DATA(numa_node_id())->node_zonelists;
> + Â Â Â first_zones_zonelist(zonelist,
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Âgfp_zone(sc.gfp_mask), NULL,
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â&sc.preferred_zone);
> +
> Â Â Â Âreturn do_try_to_free_pages(zonelist, &sc);
> Â}
> Â#endif
> Index: b/fs/buffer.c
> ===================================================================
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -290,7 +290,7 @@ static void free_more_memory(void)
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â&zone);
> Â Â Â Â Â Â Â Âif (zone)
> Â Â Â Â Â Â Â Â Â Â Â Âtry_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
> - Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â GFP_NOFS, NULL);
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â GFP_NOFS, NULL, zone);
> Â Â Â Â}
> Â}
>
> Index: b/include/linux/swap.h
> ===================================================================
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -213,7 +213,8 @@ static inline void lru_cache_add_active_
>
> Â/* linux/mm/vmscan.c */
> Âextern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> - Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â gfp_t gfp_mask, nodemask_t *mask);
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âgfp_t gfp_mask, nodemask_t *mask,
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âstruct zone *preferred_zone);
> Âextern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âgfp_t gfp_mask, bool noswap,
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âunsigned int swappiness);
> Index: b/mm/page_alloc.c
> ===================================================================
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1629,7 +1629,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
> Â Â Â Âreclaim_state.reclaimed_slab = 0;
> Â Â Â Âp->reclaim_state = &reclaim_state;
>
> - Â Â Â *did_some_progress = try_to_free_pages(zonelist, order, gfp_mask, nodemask);
> + Â Â Â *did_some_progress = try_to_free_pages(zonelist, order, gfp_mask,
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Ânodemask, preferred_zone);
>
> Â Â Â Âp->reclaim_state = NULL;
> Â Â Â Âlockdep_clear_current_reclaim_state();
>
>
>



--
Kind regards,
Minchan Kim
N‹§²æìr¸›yúèšØb²X¬¶ÇvØ^–)Þ{.nÇ+‰·¥Š{±‘êçzX§¶›¡Ü}©ž²ÆzÚ&j:+v‰¨¾«‘êçzZ+€Ê+zf£¢·hšˆ§~†­†Ûiÿûàz¹®w¥¢¸?™¨è­Ú&¢)ßf”ù^jÇy§m…á@A«a¶Úÿ 0¶ìh®å’i