Re: [PATCH 0/3] OOM detection rework v4

From: Tetsuo Handa
Date: Sat Jan 02 2016 - 10:48:31 EST


Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Mon 28-12-15 21:08:56, Tetsuo Handa wrote:
> > > Tetsuo Handa wrote:
> > > > I got OOM killers while running heavy disk I/O (extracting kernel source,
> > > > running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
> > > > Do you think these OOM killers reasonable? Too weak against fragmentation?
> > >
> > > Well, current patch invokes OOM killers when more than 75% of memory is used
> > > for file cache (active_file: + inactive_file:). I think this is a surprising
> > > thing for administrators and we want to retry more harder (but not forever,
> > > please).
> >
> > Here again, it would be good to see what is the comparision between
> > the original and the new behavior. 75% of a page cache is certainly
> > unexpected but those pages might be pinned for other reasons and so
> > unreclaimable and basically IO bound. This is hard to optimize for
> > without causing any undesirable side effects for other loads. I will
> > have a look at the oom reports later but having a comparision would be
> > a great start.
>
> Prior to "mm, oom: rework oom detection" patch (the original), this stressor
> never invoked the OOM killer. After this patch (the new), this stressor easily
> invokes the OOM killer. Both the original and the new case, active_file: +
> inactive_file: occupies nearly 75%. I think we lost invisible retry logic for
> order > 0 allocation requests.
>

I retested with below debug printk() patch.

----------
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9d70a80..e433504 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3014,7 +3014,7 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
static inline bool
should_reclaim_retry(gfp_t gfp_mask, unsigned order,
struct alloc_context *ac, int alloc_flags,
- bool did_some_progress,
+ unsigned long did_some_progress,
int no_progress_loops)
{
struct zone *zone;
@@ -3024,8 +3024,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
* Make sure we converge to OOM if we cannot make any progress
* several times in the row.
*/
- if (no_progress_loops > MAX_RECLAIM_RETRIES)
+ if (no_progress_loops > MAX_RECLAIM_RETRIES) {
+ printk(KERN_INFO "Reached MAX_RECLAIM_RETRIES.\n");
return false;
+ }

/* Do not retry high order allocations unless they are __GFP_REPEAT */
if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
@@ -3086,6 +3088,8 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,

return true;
}
+ printk(KERN_INFO "zone=%s reclaimable=%lu available=%lu no_progress_loops=%u did_some_progress=%lu\n",
+ zone->name, reclaimable, available, no_progress_loops, did_some_progress);
}

return false;
@@ -3273,7 +3277,7 @@ retry:
no_progress_loops++;

if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
- did_some_progress > 0, no_progress_loops))
+ did_some_progress, no_progress_loops))
goto retry;

/* Reclaim has failed us, start killing things */
----------

The output showed that __zone_watermark_ok() returning false on both DMA32 and DMA
zones is the trigger of the OOM killer invocation. Direct reclaim is constantly
reclaiming some pages, but I guess freelist for 2 <= order < MAX_ORDER are empty.
That trigger was introduced by commit 97a16fc82a7c5b0c ("mm, page_alloc: only
enforce watermarks for order-0 allocations"), and "mm, oom: rework oom detection"
patch hits the trigger.

----------
[ 154.547143] zone=DMA32 reclaimable=323478 available=325894 no_progress_loops=0 did_some_progress=58
[ 154.551119] zone=DMA32 reclaimable=323153 available=325770 no_progress_loops=0 did_some_progress=58
[ 154.571983] zone=DMA32 reclaimable=319582 available=322161 no_progress_loops=0 did_some_progress=56
[ 154.576121] zone=DMA32 reclaimable=319647 available=322016 no_progress_loops=0 did_some_progress=56
[ 154.583523] zone=DMA32 reclaimable=319467 available=321801 no_progress_loops=0 did_some_progress=55
[ 154.593948] zone=DMA32 reclaimable=317400 available=320988 no_progress_loops=0 did_some_progress=56
[ 154.730880] zone=DMA32 reclaimable=312385 available=313952 no_progress_loops=0 did_some_progress=48
[ 154.733226] zone=DMA32 reclaimable=312337 available=313919 no_progress_loops=0 did_some_progress=48
[ 154.737270] zone=DMA32 reclaimable=312417 available=313871 no_progress_loops=0 did_some_progress=48
[ 154.739569] zone=DMA32 reclaimable=312369 available=313844 no_progress_loops=0 did_some_progress=48
[ 154.743195] zone=DMA32 reclaimable=312385 available=313790 no_progress_loops=0 did_some_progress=48
[ 154.745534] zone=DMA32 reclaimable=312365 available=313813 no_progress_loops=0 did_some_progress=48
[ 154.748431] zone=DMA32 reclaimable=312272 available=313728 no_progress_loops=0 did_some_progress=48
[ 154.750973] zone=DMA32 reclaimable=312273 available=313760 no_progress_loops=0 did_some_progress=48
[ 154.753503] zone=DMA32 reclaimable=312289 available=313958 no_progress_loops=0 did_some_progress=48
[ 154.753584] zone=DMA32 reclaimable=312241 available=313958 no_progress_loops=0 did_some_progress=48
[ 154.753660] zone=DMA32 reclaimable=312193 available=313958 no_progress_loops=0 did_some_progress=48
[ 154.781574] zone=DMA32 reclaimable=312147 available=314095 no_progress_loops=0 did_some_progress=48
[ 154.784281] zone=DMA32 reclaimable=311539 available=314015 no_progress_loops=0 did_some_progress=49
[ 154.786639] zone=DMA32 reclaimable=311498 available=314040 no_progress_loops=0 did_some_progress=49
[ 154.788761] zone=DMA32 reclaimable=311432 available=314040 no_progress_loops=0 did_some_progress=49
[ 154.791047] zone=DMA32 reclaimable=311366 available=314040 no_progress_loops=0 did_some_progress=49
[ 154.793388] zone=DMA32 reclaimable=311300 available=314040 no_progress_loops=0 did_some_progress=49
[ 154.795802] zone=DMA32 reclaimable=311153 available=314006 no_progress_loops=0 did_some_progress=49
[ 154.804685] zone=DMA32 reclaimable=309950 available=313140 no_progress_loops=0 did_some_progress=49
[ 154.807039] zone=DMA32 reclaimable=309867 available=313138 no_progress_loops=0 did_some_progress=49
[ 154.809440] zone=DMA32 reclaimable=309761 available=313080 no_progress_loops=0 did_some_progress=49
[ 154.811583] zone=DMA32 reclaimable=309735 available=313120 no_progress_loops=0 did_some_progress=49
[ 154.814090] zone=DMA32 reclaimable=309561 available=313068 no_progress_loops=0 did_some_progress=49
[ 154.817381] zone=DMA32 reclaimable=309463 available=313030 no_progress_loops=0 did_some_progress=49
[ 154.824387] zone=DMA32 reclaimable=309414 available=313030 no_progress_loops=0 did_some_progress=49
[ 154.829582] zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
[ 154.831562] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50
[ 154.838499] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[ 154.841167] fork cpuset=/ mems_allowed=0
[ 154.842348] CPU: 1 PID: 9599 Comm: fork Tainted: G W 4.4.0-rc7-next-20151231+ #273
[ 154.844308] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[ 154.846654] 0000000000000000 0000000045061c6b ffff88007a5dbb00 ffffffff81398b83
[ 154.848559] 0000000000000000 ffff88007a5dbba0 ffffffff811bc81c 0000000000000206
[ 154.850488] ffffffff818104b0 ffff88007a5dbb40 ffffffff810bdd79 0000000000000206
[ 154.852386] Call Trace:
[ 154.853350] [<ffffffff81398b83>] dump_stack+0x4b/0x68
[ 154.854731] [<ffffffff811bc81c>] dump_header+0x5b/0x3b0
[ 154.856309] [<ffffffff810bdd79>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[ 154.858046] [<ffffffff810bde4d>] ? trace_hardirqs_on+0xd/0x10
[ 154.859593] [<ffffffff81143d36>] oom_kill_process+0x366/0x540
[ 154.861142] [<ffffffff8114414f>] out_of_memory+0x1ef/0x5a0
[ 154.862655] [<ffffffff8114420d>] ? out_of_memory+0x2ad/0x5a0
[ 154.864194] [<ffffffff81149c72>] __alloc_pages_nodemask+0xda2/0xde0
[ 154.865852] [<ffffffff810bdd00>] ? trace_hardirqs_on_caller+0x80/0x1c0
[ 154.867844] [<ffffffff81149e6c>] alloc_kmem_pages_node+0x4c/0xc0
[ 154.868726] zone=DMA32 reclaimable=309003 available=312677 no_progress_loops=0 did_some_progress=48
[ 154.868727] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=48
[ 154.875357] [<ffffffff8106d441>] copy_process.part.31+0x131/0x1b40
[ 154.877845] [<ffffffff8111d8da>] ? __audit_syscall_entry+0xaa/0xf0
[ 154.880397] [<ffffffff8106f01b>] _do_fork+0xdb/0x5d0
[ 154.882259] [<ffffffff8111d8da>] ? __audit_syscall_entry+0xaa/0xf0
[ 154.884722] [<ffffffff810030c1>] ? do_audit_syscall_entry+0x61/0x70
[ 154.887201] [<ffffffff81003254>] ? syscall_trace_enter_phase1+0x134/0x150
[ 154.889666] [<ffffffff81003017>] ? trace_hardirqs_on_thunk+0x17/0x19
[ 154.891519] [<ffffffff8106f594>] SyS_clone+0x14/0x20
[ 154.893059] [<ffffffff816feeb2>] entry_SYSCALL_64_fastpath+0x12/0x76
[ 154.894859] Mem-Info:
[ 154.895851] active_anon:31807 inactive_anon:2093 isolated_anon:0
[ 154.895851] active_file:242656 inactive_file:67266 isolated_file:0
[ 154.895851] unevictable:0 dirty:8 writeback:0 unstable:0
[ 154.895851] slab_reclaimable:15100 slab_unreclaimable:20839
[ 154.895851] mapped:1681 shmem:2162 pagetables:18491 bounce:0
[ 154.895851] free:4243 free_pcp:343 free_cma:0
[ 154.905459] Node 0 DMA free:6908kB min:44kB low:52kB high:64kB active_anon:3408kB inactive_anon:120kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:64kB shmem:124kB slab_reclaimable:872kB slab_unreclaimable:3032kB kernel_stack:176kB pagetables:328kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 154.916097] lowmem_reserve[]: 0 1714 1714 1714
[ 154.917857] Node 0 DMA32 free:17996kB min:5172kB low:6464kB high:7756kB active_anon:121688kB inactive_anon:8252kB active_file:970620kB inactive_file:269060kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1758944kB mlocked:0kB dirty:32kB writeback:0kB mapped:6660kB shmem:8524kB slab_reclaimable:59528kB slab_unreclaimable:80460kB kernel_stack:47312kB pagetables:70972kB unstable:0kB bounce:0kB free_pcp:1356kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 154.929908] lowmem_reserve[]: 0 0 0 0
[ 154.931918] Node 0 DMA: 107*4kB (UME) 72*8kB (ME) 47*16kB (UME) 19*32kB (UME) 9*64kB (ME) 1*128kB (M) 3*256kB (M) 2*512kB (E) 2*1024kB (UM) 0*2048kB 0*4096kB = 6908kB
[ 154.937453] Node 0 DMA32: 1113*4kB (UME) 1400*8kB (UME) 116*16kB (UM) 15*32kB (UM) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18052kB
[ 154.941617] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 154.944167] 312171 total pagecache pages
[ 154.945926] 0 pages in swap cache
[ 154.947521] Swap cache stats: add 0, delete 0, find 0/0
[ 154.949436] Free swap = 0kB
[ 154.950920] Total swap = 0kB
[ 154.952531] 524157 pages RAM
[ 154.954063] 0 pages HighMem/MovableOnly
[ 154.955785] 80445 pages reserved
[ 154.957362] 0 pages hwpoisoned
----------
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/