Re: [PATCH v3] Drivers: hv: vmbus: fix the race when querying & updating the percpu list

From: Vitaly Kuznetsov
Date: Tue May 31 2016 - 12:27:08 EST


Dexuan Cui <decui@xxxxxxxxxxxxx> writes:

> There is a rare race when we remove an entry from the global list
> hv_context.percpu_list[cpu] in hv_process_channel_removal() ->
> percpu_channel_deq() -> list_del(): at this time, if vmbus_on_event() ->
> process_chn_event() -> pcpu_relid2channel() is trying to query the list,
> we can get the kernel fault.
>
> Similarly, we also have the issue in the code path: vmbus_process_offer() ->
> percpu_channel_enq().
>
> We can resolve the issue by disabling the tasklet when updating the list.
>
> The patch also moves vmbus_release_relid() to a later place where
> the channel has been removed from the per-cpu and the global lists.
>
> Reported-by: Rolf Neugebauer <rolf.neugebauer@xxxxxxxxxx>
> Cc: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>
> Signed-off-by: Dexuan Cui <decui@xxxxxxxxxxxxx>

Tested 4.7-rc1 with this path applied and kernel always crashes on boot
(WS2016TP5, 12 CPU SMP guest, Generation 2):

[ 5.464251] hv_vmbus: Hyper-V Host Build:14300-10.0-1-0.1006; Vmbus version:4.0
[ 5.471666] hv_vmbus: Unknown GUID: f8e65716-3cb3-4a06-9a60-1889c5cccab5
[ 5.472143] BUG: unable to handle kernel paging request at 000000079fff5288
[ 5.477107] IP: [<ffffffffa0004b91>] vmbus_onoffer+0x311/0x570 [hv_vmbus]
[ 5.477107] PGD 0
[ 5.477107] Oops: 0000 [#1] SMP
[ 5.477107] Modules linked in: hv_vmbus
[ 5.477107] CPU: 11 PID: 189 Comm: kworker/11:1 Not tainted 4.7.0-rc1_dc1_test+ #262
[ 5.477107] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v1.0 11/26/2012
[ 5.477107] Workqueue: hv_vmbus_con vmbus_onmessage_work [hv_vmbus]
[ 5.477107] task: ffff8801796e4480 ti: ffff8801796e8000 task.ti: ffff8801796e8000
[ 5.477107] RIP: 0010:[<ffffffffa0004b91>] [<ffffffffa0004b91>] vmbus_onoffer+0x311/0x570 [hv_vmbus]
[ 5.477107] RSP: 0018:ffff8801796ebc50 EFLAGS: 00010286
[ 5.477107] RAX: 00000000ffff8801 RBX: ffff880032641000 RCX: 0000000000000050
[ 5.477107] RDX: 0000000000040000 RSI: 0000000000000000 RDI: ffff880032641000
[ 5.477107] RBP: ffff8801796ebd10 R08: 0000000000000001 R09: 0000000000000001
[ 5.477107] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000010
[ 5.477107] R13: 4a063cb3f8e65716 R14: b5caccc58918609a R15: ffffffffa0008b60
[ 5.477107] FS: 0000000000000000(0000) GS:ffff88017c000000(0000) knlGS:0000000000000000
[ 5.477107] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5.477107] CR2: 000000079fff5288 CR3: 0000000032613000 CR4: 00000000001406e0
[ 5.477107] Stack:
[ 5.477107] ffff880032641780 ffff88003264102c 0010010000000046 ffffffffa000646e
[ 5.477107] ffff8801796e5090 ffff8801796e4480 00000000004f827d 0000000000000001
[ 5.477107] 0000000000000000 ffff8801796ebce8 ffffffff810eaebc 00000000796e5058
[ 5.477107] Call Trace:
[ 5.477107] [<ffffffff810eaebc>] ? __lock_acquire+0x3dc/0x730
[ 5.477107] [<ffffffffa0005263>] vmbus_onmessage+0x33/0xa0 [hv_vmbus]
[ 5.477107] [<ffffffffa0001371>] vmbus_onmessage_work+0x21/0x30 [hv_vmbus]
[ 5.653321] [<ffffffff810abd1f>] process_one_work+0x1ff/0x6d0
[ 5.653321] [<ffffffff810abca1>] ? process_one_work+0x181/0x6d0
[ 5.653321] [<ffffffff810ac23e>] worker_thread+0x4e/0x490
[ 5.653321] [<ffffffff810ac1f0>] ? process_one_work+0x6d0/0x6d0
[ 5.653321] [<ffffffff810ac1f0>] ? process_one_work+0x6d0/0x6d0
[ 5.653321] [<ffffffff810b31b1>] kthread+0x101/0x120
[ 5.653321] [<ffffffff81739cef>] ret_from_fork+0x1f/0x40
[ 5.653321] [<ffffffff810b30b0>] ? kthread_create_on_node+0x250/0x250
[ 5.653321] Code: 74 24 08 48 c7 c7 60 6c 00 a0 e8 0a 9e 1b e1 b8 10 00 00 00 66 89 44 24 16 44 89 e6 48 89 df e8 f6 f9 ff ff 41 8b 87 f4 02 00 00 <48> 8b 14 c5 80 12 03 a0 f0 ff 42 10 48 8b 42 08 a8 02 75 f8 0f
[ 5.653321] RIP [<ffffffffa0004b91>] vmbus_onoffer+0x311/0x570 [hv_vmbus]
[ 5.653321] RSP <ffff8801796ebc50>
[ 5.653321] CR2: 000000079fff5288
[ 5.653321] ---[ end trace 62df6070997f1f10 ]---
[ 5.653321] Kernel panic - not syncing: Fatal exception
[ 5.653321] Kernel Offset: disabled
[ 5.653321] ---[ end Kernel panic - not syncing: Fatal exception
[ 5.653480] ------------[ cut here ]------------

I can investigate it tomorrow if this doesn't reproduce for you.

<skip>

--
Vitaly