Re: mlx5: ifup failure due to huge allocation
From: Saeed Mahameed
Date: Wed Nov 02 2016 - 19:30:30 EST
On Wed, Nov 2, 2016 at 3:37 PM, Sebastian Ott <sebott@xxxxxxxxxxxxxxxxxx> wrote:
> Hi,
>
> Ifup on an interface provided by CX4 (MLX5 driver) on s390 fails with:
>
> [ 22.318553] ------------[ cut here ]------------
> [ 22.318564] WARNING: CPU: 1 PID: 399 at mm/page_alloc.c:3421 __alloc_pages_nodemask+0x2ee/0x1298
> [ 22.318568] Modules linked in: mlx4_ib ib_core mlx5_core mlx4_en mlx4_core [...]
> [ 22.318610] CPU: 1 PID: 399 Comm: NetworkManager Not tainted 4.8.0 #13
> [ 22.318614] Hardware name: IBM 2964 N96 704 (LPAR)
> [ 22.318618] task: 00000000dbe1c008 task.stack: 00000000dd9e4000
> [ 22.318622] Krnl PSW : 0704c00180000000 00000000002a427e (__alloc_pages_nodemask+0x2ee/0x1298)
> [ 22.318631] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
> Krnl GPRS: 0000000000000000 0000000000ceb4d4 00000000024080c0 0000000000000001
> [ 22.318640] 00000000002a4204 00000000ffffa410 00000000001fffff 0000000000000001
> [ 22.318644] 00000000024080c0 0000000000000009 0000000000000000 0000000000000000
> [ 22.318648] 00000000ffffa400 000000000088ea30 00000000002a4204 00000000dd9e7060
> [ 22.318660] Krnl Code: 00000000002a4272: a7740592 brc 7,2a4d96
> 00000000002a4276: 92011000 mvi 0(%r1),1
> #00000000002a427a: a7f40001 brc 15,2a427c
> >00000000002a427e: a7f4058c brc 15,2a4d96
> 00000000002a4282: 5830f0b4 l %r3,180(%r15)
> 00000000002a4286: 5030f0ec st %r3,236(%r15)
> 00000000002a428a: 1823 lr %r2,%r3
> 00000000002a428c: a53e0048 llilh %r3,72
> [ 22.318695] Call Trace:
> [ 22.318700] ([<00000000002a4204>] __alloc_pages_nodemask+0x274/0x1298)
> [ 22.318706] ([<000000000030dac0>] alloc_pages_current+0x1c0/0x268)
> [ 22.318712] ([<0000000000135aa6>] s390_dma_alloc+0x6e/0x1e0)
> [ 22.318733] ([<000003ff8015474c>] mlx5_dma_zalloc_coherent_node+0xb4/0xf8 [mlx5_core])
> [ 22.318748] ([<000003ff80154c58>] mlx5_buf_alloc_node+0x70/0x108 [mlx5_core])
> [ 22.318765] ([<000003ff8015fe06>] mlx5_cqwq_create+0xf6/0x180 [mlx5_core])
> [ 22.318783] ([<000003ff8016654c>] mlx5e_open_cq+0xac/0x1e0 [mlx5_core])
> [ 22.318802] ([<000003ff801693e6>] mlx5e_open_channels+0xe66/0xeb8 [mlx5_core])
> [ 22.318820] ([<000003ff8016982e>] mlx5e_open_locked+0x8e/0x1e0 [mlx5_core])
> [ 22.318837] ([<000003ff801699c6>] mlx5e_open+0x46/0x68 [mlx5_core])
> [ 22.318844] ([<0000000000748338>] __dev_open+0xa8/0x118)
> [ 22.318848] ([<000000000074867a>] __dev_change_flags+0xc2/0x190)
> [ 22.318853] ([<000000000074877e>] dev_change_flags+0x36/0x78)
> [ 22.318858] ([<000000000075bc8a>] do_setlink+0x332/0xb30)
> [ 22.318862] ([<000000000075de3a>] rtnl_newlink+0x3e2/0x820)
> [ 22.318867] ([<000000000075e46e>] rtnetlink_rcv_msg+0x1f6/0x248)
> [ 22.318873] ([<0000000000782202>] netlink_rcv_skb+0x92/0x108)
> [ 22.318878] ([<000000000075c668>] rtnetlink_rcv+0x48/0x58)
> [ 22.318882] ([<0000000000781ace>] netlink_unicast+0x14e/0x1f0)
> [ 22.318887] ([<0000000000781f82>] netlink_sendmsg+0x32a/0x3b0)
> [ 22.318892] ([<000000000071d502>] sock_sendmsg+0x5a/0x80)
> [ 22.318897] ([<000000000071ed38>] ___sys_sendmsg+0x270/0x2a8)
> [ 22.318901] ([<000000000071fe80>] __sys_sendmsg+0x60/0x90)
> [ 22.318905] ([<00000000007207c6>] SyS_socketcall+0x2be/0x388)
> [ 22.318912] ([<000000000086fcae>] system_call+0xd6/0x270)
> [ 22.318916] 3 locks held by NetworkManager/399:
> [ 22.318920] #0: (rtnl_mutex){+.+.+.}, at: [<000000000075c658>] rtnetlink_rcv+0x38/0x58
> [ 22.318935] #1: (&priv->state_lock){+.+.+.}, at: [<000003ff801699bc>] mlx5e_open+0x3c/0x68 [mlx5_core]
> [ 22.318962] #2: (&priv->alloc_mutex){+.+.+.}, at: [<000003ff801546e0>] mlx5_dma_zalloc_coherent_node+0x48/0xf8 [mlx5_core]
> [ 22.318987] Last Breaking-Event-Address:
> [ 22.318992] [<00000000002a427a>] __alloc_pages_nodemask+0x2ea/0x1298
> [ 22.318996] ---[ end trace d2b54f5a0cd00b89 ]---
> [ 22.319001] mlx5_core 0001:00:00.0: 0001:00:00.0:mlx5_cqwq_create:121:(pid 399): mlx5_buf_alloc_node() failed, -12
> [ 22.320548] mlx5_core 0001:00:00.0 enP1s171: mlx5e_open_locked: mlx5e_open_channels failed, -12
>
>
>
> This fails because the largest possible allocation on s390 is currently 1MB (order 8).
> Would it be possible to add the __GFP_NOWARN flag and try a smaller allocation if the
> big one failed? (The latter change also would make the device usable when it is added
> via hotplug and free memory is scattered).
>
Thanks Sebastian for the detailed report.
We are planing and working on a solution to allocate fragmented
buffers rather than demanding contiguous ones.
Hopefully we will have the solution upstream before 4.10 is released.
and yes __GFP_NOWARN is reasonable, will have it as well, the return
value of mlx5_buf_alloc_node is sufficient in case of an error, the
stack trace is just noise.
-Saeed.