Re: [PATCH 0/3 -v3] GFP_NOFAIL cleanups
From: Tetsuo Handa
Date: Tue Jan 03 2017 - 09:39:32 EST
Michal Hocko wrote:
> On Tue 03-01-17 10:36:31, Tetsuo Handa wrote:
> [...]
> > I'm OK with "[PATCH 1/3] mm: consolidate GFP_NOFAIL checks in the allocator
> > slowpath" given that we describe that we make __GFP_NOFAIL stronger than
> > __GFP_NORETRY with this patch in the changelog.
>
> Again. __GFP_NORETRY | __GFP_NOFAIL is nonsense! I do not really see any
> reason to describe all the nonsense combinations of gfp flags.
Before [PATCH 1/3]:
__GFP_NORETRY is used as "Do not invoke the OOM killer. Fail allocation
request even if __GFP_NOFAIL is specified if direct reclaim/compaction
did not help."
__GFP_NOFAIL is used as "Never fail allocation request unless __GFP_NORETRY
is specified even if direct reclaim/compaction did not help."
After [PATCH 1/3]:
__GFP_NORETRY is used as "Do not invoke the OOM killer. Fail allocation
request unless __GFP_NOFAIL is specified."
__GFP_NOFAIL is used as "Never fail allocation request even if direct
reclaim/compaction did not help. Invoke the OOM killer unless __GFP_NORETRY is
specified."
Thus, __GFP_NORETRY | __GFP_NOFAIL perfectly makes sense as
"Never fail allocation request if direct reclaim/compaction did not help.
But do not invoke the OOM killer even if direct reclaim/compaction did not help."
>
> > But I don't think "[PATCH 2/3] mm, oom: do not enfore OOM killer for __GFP_NOFAIL
> > automatically" is correct. Firstly, we need to confirm
> >
> > "The pre-mature OOM killer is a real issue as reported by Nils Holland"
> >
> > in the changelog is still true because we haven't tested with "[PATCH] mm, memcg:
> > fix the active list aging for lowmem requests when memcg is enabled" applied and
> > without "[PATCH 2/3] mm, oom: do not enfore OOM killer for __GFP_NOFAIL
> > automatically" and "[PATCH 3/3] mm: help __GFP_NOFAIL allocations which do not
> > trigger OOM killer" applied.
>
> Yes I have dropped the reference to this report already in my local
> patch because in this particular case the issue was somewhere else
> indeed!
OK.
>
> > Secondly, as you are using __GFP_NORETRY in "[PATCH] mm: introduce kv[mz]alloc
> > helpers" as a mean to enforce not to invoke the OOM killer
> >
> > /*
> > * Make sure that larger requests are not too disruptive - no OOM
> > * killer and no allocation failure warnings as we have a fallback
> > */
> > if (size > PAGE_SIZE)
> > kmalloc_flags |= __GFP_NORETRY | __GFP_NOWARN;
> >
> > , we can use __GFP_NORETRY as a mean to enforce not to invoke the OOM killer
> > rather than applying "[PATCH 2/3] mm, oom: do not enfore OOM killer for
> > __GFP_NOFAIL automatically".
> >
As I wrote above, __GFP_NORETRY | __GFP_NOFAIL perfectly makes sense.
> > Additionally, although currently there seems to be no
> > kv[mz]alloc(GFP_KERNEL | __GFP_NOFAIL) users, kvmalloc_node() in
> > "[PATCH] mm: introduce kv[mz]alloc helpers" will be confused when a
> > kv[mz]alloc(GFP_KERNEL | __GFP_NOFAIL) user comes in in the future because
> > "[PATCH 1/3] mm: consolidate GFP_NOFAIL checks in the allocator slowpath" makes
> > __GFP_NOFAIL stronger than __GFP_NORETRY.
>
> Using NOFAIL in kv[mz]alloc simply makes no sense at all. The vmalloc
> fallback would be simply unreachable!
My intention is shown below.
void *kvmalloc_node(size_t size, gfp_t flags, int node)
{
gfp_t kmalloc_flags = flags;
void *ret;
/*
* vmalloc uses GFP_KERNEL for some internal allocations (e.g page tables)
* so the given set of flags has to be compatible.
*/
WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
/*
* Make sure that larger requests are not too disruptive - no OOM
* killer and no allocation failure warnings as we have a fallback
*/
- if (size > PAGE_SIZE)
+ if (size > PAGE_SIZE) {
kmalloc_flags |= __GFP_NORETRY | __GFP_NOWARN;
+ kmalloc_flags &= ~__GFP_NOFAIL;
+ }
ret = kmalloc_node(size, kmalloc_flags, node);
/*
* It doesn't really make sense to fallback to vmalloc for sub page
* requests
*/
if (ret || size <= PAGE_SIZE)
return ret;
return __vmalloc_node_flags(size, node, flags);
}
>
> > My concern with "[PATCH 3/3] mm: help __GFP_NOFAIL allocations which
> > do not trigger OOM killer" is
> >
> > "AFAIU, this is an allocation path which doesn't block a forward progress
> > on a regular IO. It is merely a check whether there is a new medium in
> > the CDROM (aka regular polling of the device). I really fail to see any
> > reason why this one should get any access to memory reserves at all."
> >
> > in http://lkml.kernel.org/r/20161218163727.GC8440@xxxxxxxxxxxxxx .
> > Indeed that trace is a __GFP_DIRECT_RECLAIM and it might not be blocking
> > other workqueue items which a regular I/O depend on, I think there are
> > !__GFP_DIRECT_RECLAIM memory allocation requests for issuing SCSI commands
> > which could potentially start failing due to helping GFP_NOFS | __GFP_NOFAIL
> > allocations with memory reserves. If a SCSI disk I/O request fails due to
> > GFP_ATOMIC memory allocation failures because we allow a FS I/O request to
> > use memory reserves, it adds a new problem.
>
> Do you have any example of such a request? Anything that requires
> a forward progress during IO should be using mempools otherwise it
> is broken pretty much by design already. Also IO depending on NOFS
> allocations sounds pretty much broken already. So I suspect the above
> reasoning is just bogus.
You are missing my point. My question is "who needs memory reserves".
I'm not saying that disk I/O depends on GFP_NOFS allocation. I'm worrying
that [PATCH 3/3] consumes memory reserves when disk I/O also depends on
memory reserves.
My understanding is that when accessing SCSI disks, SCSI protocol is used.
SCSI driver allocates memory at runtime for using SCSI protocol using
GFP_ATOMIC. And GFP_ATOMIC uses some of memory reserves. But [PATCH 3/3]
also uses memory reserves. If memory reserves are consumed by [PATCH 3/3]
to the level where GFP_ATOMIC cannot succeed, I think it causes troubles.
I'm unable to obtain nice backtraces, but I think we can confirm that
there are GFP_ATOMIC allocations (e.g. sg_alloc_table_chained() calls
__sg_alloc_table(GFP_ATOMIC)) when we are using SCSI disks.
stap -DSTAP_NO_OVERLOAD=1 -DMAXSKIPPED=10000000 -e 'global in_scope; global traces;
probe begin { println("Ready."); }
probe kernel.function("scsi_*").call { in_scope[tid()]++; }
probe kernel.function("scsi_*").return { in_scope[tid()]--; }
probe kernel.function("__alloc_pages_nodemask") {
if (in_scope[tid()] != 0) {
bt = backtrace();
if (!([bt,$gfp_mask] in traces)) {
traces[bt,$gfp_mask] = 1;
printf("mode=0x%x,order=%u by %s(%u)\n",
$gfp_mask, $order, execname(), pid());
print_backtrace();
println();
}
}
}'
[ 183.887841] ------------[ cut here ]------------
[ 183.888847] WARNING: CPU: 0 PID: 0 at /root/systemtap.tmp/share/systemtap/runtime/linux/addr-map.c:42 lookup_bad_addr.isra.33+0x84/0x90 [stap_40159b816dfa3e3814a532dc982f551b_1_4182]
[ 183.892011] Modules linked in: stap_40159b816dfa3e3814a532dc982f551b_1_4182(O) nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper vmw_vsock_vmci_transport ppdev vmw_balloon vsock pcspkr sg parport_pc parport vmw_vmci i2c_piix4 shpchp ip_tables xfs libcrc32c sd_mod sr_mod cdrom ata_generic pata_acpi crc32c_intel
[ 183.905779] serio_raw mptspi scsi_transport_spi vmwgfx mptscsih ahci drm_kms_helper libahci syscopyarea sysfillrect mptbase sysimgblt fb_sys_fops ttm ata_piix drm e1000 i2c_core libata
[ 183.909128] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O 4.10.0-rc2-next-20170103 #486
[ 183.910839] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[ 183.912815] Call Trace:
[ 183.913326] <IRQ>
[ 183.913729] dump_stack+0x85/0xc9
[ 183.914366] __warn+0xd1/0xf0
[ 183.914930] warn_slowpath_null+0x1d/0x20
[ 183.915691] lookup_bad_addr.isra.33+0x84/0x90 [stap_40159b816dfa3e3814a532dc982f551b_1_4182]
[ 183.917539] unwind_frame.constprop.79+0xbed/0x1270 [stap_40159b816dfa3e3814a532dc982f551b_1_4182]
[ 183.919238] _stp_stack_kernel_get+0x16e/0x4d0 [stap_40159b816dfa3e3814a532dc982f551b_1_4182]
[ 183.920870] ? gfp_pfmemalloc_allowed+0x80/0x80
[ 183.921737] _stp_stack_kernel_print+0x3e/0xc0 [stap_40159b816dfa3e3814a532dc982f551b_1_4182]
[ 183.923583] probe_3838+0x1f6/0x8c0 [stap_40159b816dfa3e3814a532dc982f551b_1_4182]
[ 183.925005] ? __alloc_pages_nodemask+0x1/0x4e0
[ 183.925875] ? __alloc_pages_nodemask+0x1/0x4e0
[ 183.926724] ? __alloc_pages_nodemask+0x1/0x4e0
[ 183.927598] enter_kprobe_probe+0x1e5/0x310 [stap_40159b816dfa3e3814a532dc982f551b_1_4182]
[ 183.929164] kprobe_ftrace_handler+0xe9/0x150
[ 183.930260] ? __alloc_pages_nodemask+0x5/0x4e0
[ 183.931123] ftrace_ops_assist_func+0xbf/0x110
[ 183.932012] ? sched_clock+0x9/0x10
[ 183.932815] ? sched_clock_cpu+0x84/0xb0
[ 183.933550] 0xffffffffa00430d5
[ 183.934947] ? gfp_pfmemalloc_allowed+0x80/0x80
[ 183.936581] __alloc_pages_nodemask+0x5/0x4e0
[ 183.938409] alloc_pages_current+0x97/0x1b0
[ 183.939971] ? __alloc_pages_nodemask+0x5/0x4e0
[ 183.941567] ? alloc_pages_current+0x97/0x1b0
[ 183.943173] new_slab+0x3a8/0x6a0
[ 183.944702] ___slab_alloc+0x3a3/0x620
[ 183.946157] ? stp_lock_probe+0x4a/0xb0 [stap_40159b816dfa3e3814a532dc982f551b_1_4182]
[ 183.948375] ? mempool_alloc_slab+0x1c/0x20
[ 183.950135] ? mempool_alloc_slab+0x1c/0x20
[ 183.951957] __slab_alloc+0x46/0x7d
[ 183.953415] ? mempool_alloc_slab+0x1c/0x20
[ 183.955001] kmem_cache_alloc+0x2e8/0x370
[ 183.956552] ? aggr_pre_handler+0x3f/0x80
[ 183.958428] mempool_alloc_slab+0x1c/0x20
[ 183.960042] mempool_alloc+0x79/0x1b0
[ 183.961512] ? kprobe_ftrace_handler+0xa3/0x150
[ 183.963082] ? scsi_init_io+0x5/0x1d0
[ 183.964501] sg_pool_alloc+0x45/0x50
[ 183.966064] __sg_alloc_table+0xdf/0x150
[ 183.967402] ? sg_free_table_chained+0x30/0x30
[ 183.969030] sg_alloc_table_chained+0x3f/0x90
[ 183.970405] scsi_init_sgtable+0x31/0x70
[ 183.971783] copy_oldmem_page+0xd0/0xd0
[ 183.973506] copy_oldmem_page+0xd0/0xd0
[ 183.974802] sd_init_command+0x3c/0xb0 [sd_mod]
[ 183.976187] scsi_setup_cmnd+0xf0/0x150
[ 183.977617] copy_oldmem_page+0xd0/0xd0
[ 183.978849] ? scsi_prep_fn+0x5/0x170
[ 183.980119] copy_oldmem_page+0xd0/0xd0
[ 183.981406] scsi_request_fn+0x42/0x740
[ 183.982760] ? scsi_request_fn+0x5/0x740
[ 183.984049] copy_oldmem_page+0xd0/0xd0
[ 183.985438] blk_run_queue+0x26/0x40
[ 183.986751] scsi_kick_queue+0x25/0x30
[ 183.987891] copy_oldmem_page+0xd0/0xd0
[ 183.989073] copy_oldmem_page+0xd0/0xd0
[ 183.990175] copy_oldmem_page+0xd0/0xd0
[ 183.991348] copy_oldmem_page+0xd0/0xd0
[ 183.992450] copy_oldmem_page+0xd0/0xd0
[ 183.993677] blk_done_softirq+0xa8/0xd0
[ 183.994766] __do_softirq+0xc0/0x52d
[ 183.995792] irq_exit+0xf5/0x110
[ 183.996748] smp_call_function_single_interrupt+0x33/0x40
[ 183.998230] call_function_single_interrupt+0x9d/0xb0
[ 183.999521] RIP: 0010:native_safe_halt+0x6/0x10
[ 184.000791] RSP: 0018:ffffffff81c03dd0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff04
[ 184.002749] RAX: ffffffff81c18500 RBX: 0000000000000000 RCX: 0000000000000000
[ 184.004440] RDX: ffffffff81c18500 RSI: 0000000000000001 RDI: ffffffff81c18500
[ 184.006281] RBP: ffffffff81c03dd0 R08: 0000000000000000 R09: 0000000000000000
[ 184.008063] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
[ 184.009842] R13: ffffffff81c18500 R14: ffffffff81c18500 R15: 0000000000000000
[ 184.011758] </IRQ>
[ 184.012952] default_idle+0x23/0x1d0
[ 184.014085] arch_cpu_idle+0xf/0x20
[ 184.015204] default_idle_call+0x23/0x40
[ 184.016347] do_idle+0x162/0x230
[ 184.017353] cpu_startup_entry+0x71/0x80
[ 184.018686] rest_init+0x138/0x140
[ 184.019961] ? rest_init+0x5/0x140
[ 184.021034] start_kernel+0x4ba/0x4db
[ 184.022305] ? set_init_arg+0x55/0x55
[ 184.023405] ? early_idt_handler_array+0x120/0x120
[ 184.024730] x86_64_start_reservations+0x2a/0x2c
[ 184.026121] x86_64_start_kernel+0x14c/0x16f
[ 184.027329] start_cpu+0x14/0x14
[ 184.028391] ---[ end trace cbfebb3ae93a99b9 ]---
>
> That being said, to summarize your arguments again. 1) you do not like
> that a combination of __GFP_NORETRY | __GFP_NOFAIL is not documented
> to never fail,
Correct.
> 2) based on that you argue that kv[mvz]alloc with
> __GFP_NOFAIL will never reach vmalloc
Wrong.
> and 3) that there might be some IO
> paths depending on NOFS|NOFAIL allocation which would have harder time
> to make forward progress.
Wrong.
>
> I would call 1 and 2 just bogus and 3 highly dubious at best. Do not
> get me wrong but this is not what I call a useful review feedback yet
> alone a reason to block these patches. If there are any reasons to not
> merge them these are not those.