[bisected] Stack overflow after fs: "switch the IO-triggering parts of umount to fs_pin" (was net namespaces kernel stack overflow)

From: Kirill Tkhai
Date: Thu Apr 19 2018 - 08:50:48 EST


Hi, Al,

commit 87b95ce0964c016ede92763be9c164e49f1019e9 is the first after which the below test crashes the kernel:

Author: Al Viro <viro@xxxxxxxxxxxxxxxxxx>
Date: Sat Jan 10 19:01:08 2015 -0500

switch the IO-triggering parts of umount to fs_pin

Signed-off-by: Al Viro <viro@xxxxxxxxxxxxxxxxxx>

$modprobe dummy

$while true
do
mkdir /var/run/netns
touch /var/run/netns/init_net
mount --bind /proc/1/ns/net /var/run/netns/init_net

ip netns add foo
ip netns exec foo ip link add dummy0 type dummy
ip netns delete foo
done

[ 22.058349] ip (3249) used greatest stack depth: 8 bytes left
[ 22.182195] BUG: unable to handle kernel paging request at 000000035bb1f080
[ 22.183065] IP: [<ffffffff810718e4>] kick_process+0x34/0x80
[ 22.183065] PGD 0
[ 22.183065] Thread overran stack, or stack corrupted
[ 22.183065] Oops: 0000 [#1] PREEMPT SMP
[ 22.183065] CPU: 1 PID: 3255 Comm: ip Not tainted 3.19.0-rc5+ #111
[ 22.183065] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014
[ 22.183065] task: ffff88007c475100 ti: ffff88007b3cc000 task.ti: ffff88007b3cc000
[ 22.183065] RIP: 0010:[<ffffffff810718e4>] [<ffffffff810718e4>] kick_process+0x34/0x80
[ 22.183065] RSP: 0018:ffff88007b3cfcf8 EFLAGS: 00010293
[ 22.183065] RAX: 0000000000012900 RBX: ffff88007c475100 RCX: ffff88007b20e7b8
[ 22.183065] RDX: 000000007b3cc028 RSI: ffffffff819b05f8 RDI: ffffffff819cb999
[ 22.183065] RBP: ffff88007b3cfd08 R08: ffffffff81cbf688 R09: ffff88007d3d0810
[ 22.183065] R10: ffff88007fc933c8 R11: 0000000000000000 R12: 000000007b3cc028
[ 22.183065] R13: ffff88007c475100 R14: 0000000000000000 R15: 00007fff7793a448
[ 22.183065] FS: 00007fc987546700(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000
[ 22.183065] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 22.183065] CR2: 000000035bb1f080 CR3: 0000000001c11000 CR4: 00000000000006e0
[ 22.183065] Stack:
[ 22.183065] ffff88007c3b67b8 ffff88007b3cfd98 ffff88007b3cfd18 ffffffff81066b05
[ 22.183065] ffff88007b3cfd38 ffffffff81176f4c ffff88007b3cfd48 ffff88007c3b68a0
[ 22.183065] ffff88007b3cfd48 ffffffff8117777f ffff88007b3cfd68 ffffffff81177a49
[ 22.183065] Call Trace:
[ 22.183065] [<ffffffff81066b05>] task_work_add+0x45/0x60
[ 22.183065] [<ffffffff81176f4c>] mntput_no_expire+0xdc/0x150
[ 22.183065] [<ffffffff8117777f>] mntput+0x1f/0x30
[ 22.183065] [<ffffffff81177a49>] drop_mountpoint+0x29/0x30
[ 22.183065] [<ffffffff81188df6>] pin_kill+0x66/0xf0
[ 22.183065] [<ffffffff81082c60>] ? __wake_up_common+0x90/0x90
[ 22.183065] [<ffffffff81188ed9>] group_pin_kill+0x19/0x40
[ 22.183065] [<ffffffff811761d8>] namespace_unlock+0x58/0x60
[ 22.183065] [<ffffffff81178cae>] drop_collected_mounts+0x4e/0x60
[ 22.183065] [<ffffffff8117a3ed>] put_mnt_ns+0x2d/0x50
[ 22.183065] [<ffffffff81068b0a>] free_nsproxy+0x1a/0x80
[ 22.183065] [<ffffffff81068c68>] switch_task_namespaces+0x58/0x70
[ 22.183065] [<ffffffff81068c8b>] exit_task_namespaces+0xb/0x10
[ 22.183065] [<ffffffff8104eb57>] do_exit+0x2c7/0xc00
[ 22.183065] [<ffffffff8104f50a>] do_group_exit+0x3a/0xa0
[ 22.183065] [<ffffffff8104f57f>] SyS_exit_group+0xf/0x10
[ 22.183065] [<ffffffff817ad092>] system_call_fastpath+0x12/0x17

Kirill

On 19.04.2018 01:08, Kirill Tkhai wrote:
> Hi, Alexander!
>
> On 18.04.2018 22:45, Alexander Aring wrote:
>> I currently can crash my net/master kernel by execute the following script:
>>
>> --- snip
>>
>> modprobe dummy
>>
>> #mkdir /var/run/netns
>> #touch /var/run/netns/init_net
>> #mount --bind /proc/1/ns/net /var/run/netns/init_net
>>
>> while true
>> do
>> mkdir /var/run/netns
>> touch /var/run/netns/init_net
>> mount --bind /proc/1/ns/net /var/run/netns/init_net
>>
>> ip netns add foo
>> ip netns exec foo ip link add dummy0 type dummy
>> ip netns delete foo
>> done
>
> Fast answer is the best, so I tried your test on my not-for-work computer.
> There is old kernel without asynchronous pernet operations:
>
> $uname -a
> Linux localhost.localdomain 4.15.0-2-amd64 #1 SMP Debian 4.15.11-1 (2018-03-20) x86_64 GNU/Linux
>
> After approximately 15 seconds of your test execution it died :(
> (Hopefully, I executed it in "init 1" with all partitions RO as usual).
>
> There is no serial console, so I can't say that the first stack is exactly
> the same as you see. But it crashed. So, it seems, the problem have been
> existing long ago.
>
> Have you tried to reproduce it in older kernels or to bisect the problem commit?
> Or maybe it doesn't reproduce on old kernels in your environment?
>
>> --- snap
>>
>> After max ~1 minute the kernel will crash.
>> Doing my hack of saving init_net outside the loop it will run fine...
>> So the mount bind is necessary.
>>
>> The last message which I see is:
>>
>> BUG: stack guard page was hit at 00000000f0751759 (stack is
>> 0000000069363195..0000000073ddc474)
>> kernel stack overflow (double-fault): 0000 [#1] SMP PTI
>> Modules linked in:
>> CPU: 0 PID: 13917 Comm: ip Not tainted 4.16.0-11878-gef9d066f6808 #32
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
>> RIP: 0010:validate_chain.isra.23+0x44/0xc40
>> RSP: 0018:ffffc900002cbff8 EFLAGS: 00010002
>> RAX: 0000000000040000 RBX: 0e58b88e1d4d15da RCX: 0e58b88e1d4d15da
>> RDX: 0000000000000000 RSI: ffff8802b25ee2a0 RDI: ffff8802b25edb00
>> RBP: 0e58b88e1d4d15da R08: 0000000000000000 R09: 0000000000000004
>> R10: ffffc900002cc050 R11: ffff8802b1054be8 R12: 0000000000000001
>> R13: ffff8802b25ee268 R14: ffff8802b25edb00 R15: 0000000000000000
>> FS: 0000000000000000(0000) GS:ffff8802bfc00000(0000) knlGS:0000000000000000
>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: ffffc900002cbfe8 CR3: 0000000002024000 CR4: 00000000000006f0
>> Call Trace:
>> ? get_max_files+0x10/0x10
>> __lock_acquire+0x332/0x710
>> lock_acquire+0x67/0xb0
>> ? lockref_put_or_lock+0x9/0x30
>> ? dput.part.7+0x17/0x2d0
>> _raw_spin_lock+0x2b/0x60
>> ? lockref_put_or_lock+0x9/0x30
>> lockref_put_or_lock+0x9/0x30
>> dput.part.7+0x1ec/0x2d0
>> drop_mountpoint+0x10/0x40
>> pin_kill+0x9b/0x3a0
>> ? wait_woken+0x90/0x90
>> ? mnt_pin_kill+0x2d/0x100
>> mnt_pin_kill+0x2d/0x100
>> cleanup_mnt+0x66/0x70
>> pin_kill+0x9b/0x3a0
>> ? wait_woken+0x90/0x90
>> ? mnt_pin_kill+0x2d/0x100
>> mnt_pin_kill+0x2d/0x100
>> cleanup_mnt+0x66/0x70
>> ...
>>
>> I guess maybe it has something to do with recently switching to
>> migrate per-net ops to async.
>>
>> - Alex
>
> Kirill
>