Re: [PATCH] hyper-v: Check for ring buffer in hv_get_bytes_to_read/write

From: Mohammed Gamal
Date: Wed Mar 13 2019 - 06:25:16 EST


On Tue, 2019-03-12 at 18:02 +0000, Haiyang Zhang wrote:
> Â
> Â
> > -----Original Message-----
> > From: Mohammed Gamal <mgamal@xxxxxxxxxx>
> > Sent: Thursday, March 7, 2019 1:32 PM
> > To: Michael Kelley <mikelley@xxxxxxxxxxxxx>; linux-hyperv@xxxxxxxxx
> el.org;
> > kimbrownkd <kimbrownkd@xxxxxxxxx>
> > Cc: Sasha Levin <Alexander.Levin@xxxxxxxxxxxxx>; Dexuan Cui
> > <decui@xxxxxxxxxxxxx>; Stephen Hemminger <sthemmin@xxxxxxxxxxxxx>;
> > Long Li <longli@xxxxxxxxxxxxx>; KY Srinivasan <kys@xxxxxxxxxxxxx>;
> Haiyang
> > Zhang <haiyangz@xxxxxxxxxxxxx>; vkuznets <vkuznets@xxxxxxxxxx>;
> linux-
> > kernel@xxxxxxxxxxxxxxx
> > Subject: Re: [PATCH] hyper-v: Check for ring buffer in
> > hv_get_bytes_to_read/write
> >
> > On Thu, 2019-03-07 at 17:33 +0000, Michael Kelley wrote:
> > > From: Mohammed Gamal <mgamal@xxxxxxxxxx> Sent: Thursday, March 7,
> > > 2019 8:36 AM
> > > >
> > > > This patch adds a check for the presence of the ring buffer in
> > > > hv_get_bytes_to_read/write() to avoid possible NULL pointer
> > > > dereferences.
> > > > If the ring buffer is not yet allocated, return 0 bytes to be
> > > > read/written.
> > > >
> > > > The root cause is that code that accesses the ring buffer
> including
> > > > hv_get_bytes_to_read/write() could be vulnerable to the race
> > > > condition discussed in
> > > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F
> %2Flk
> > > >
> > ml.org%2Flkml%2F2018%2F10%2F18%2F779&amp;data=02%7C01%7Chaiyangz
> > %40m
> > > >
> > icrosoft.com%7C73af013c14034bb0b1ad08d6a32b419c%7C72f988bf86f141af9
> > 1
> > > >
> > ab2d7cd011db47%7C1%7C0%7C636875803518430021&amp;sdata=b51Xc5GUN
> > nHX0K
> > > > 08LrH3ShTyFcRZ4mYHUATd%2BDpvYDw%3D&amp;reserved=0>;
> > > >
> > > > This race is being addressed by the patch series by Kimberly
> Brown
> > > > in
> > > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F
> %2Flk
> > > >
> > ml.org%2Flkml%2F2019%2F2%2F21%2F1236&amp;data=02%7C01%7Chaiyangz
> > %40m
> > > >
> > icrosoft.com%7C73af013c14034bb0b1ad08d6a32b419c%7C72f988bf86f141af9
> > 1
> > > >
> > ab2d7cd011db47%7C1%7C0%7C636875803518430021&amp;sdata=js1ff15Gbk7
> > 0MD
> > > > A2hkMZExxvAAbDuKDhfBvc5ZrckzM%3D&amp;reserved=0 which is not
> > final
> > > > yet
> > > >
> > > > Signed-off-by: Mohammed Gamal <mgamal@xxxxxxxxxx>
> > >
> > > Could you elaborate on the code paths where
> > > hv_get_bytes_to_read/write() could be called when the ring buffer
> > > isn't yet allocated?ÂÂMy sense is that Kim Brown's patch will
> address
> > > all of the code paths that involved sysfs access from outside the
> > > driver.ÂÂAnd within a driver, the ring buffer should never be
> accessed
> > > unless it is already allocated.ÂÂIs there another code path we're
> not
> > > aware of?ÂÂI'm wondering if these changes are really needed once
> Kim
> > > Brown's patch is finished.
> > >
> > > Michael
> >
> > I've seen one instance of the race in the netvsc driver when
> running traffic
> > through it with iperf3 while continuously changing the channel
> settings.
> >
> > The following code path deallocates the ring buffer:
> > netvsc_set_channels() -> netvsc_detach() ->
> > rndis_filter_device_remove() -> netvsc_device_remove() ->
> vmbus_close()
> > -> vmbus_free_ring() -> hv_ringbuffer_cleanup().
> >
> > netvsc_send_pkt() -> hv_get_bytes_to_write() might get called
> concurrently
> > after vmbus_close() and before vmbus_open() returns and sets up the
> new ring
> > buffer.
> >
> > The race is fairly hard to reproduce on recent upstream kernels,
> but I still
> > managed to reproduce it.
> Â
> Looking at the code from netvsc_detach() â
> ÂÂÂÂÂÂÂÂÂnetif_tx_disable(ndev) is called before
> rndis_filter_device_remove(hdev, nvdev).
> So there should be no call to netvsc_send_pkt() after detaching.
> Whatâs the crash stack trace?
> Â
> static int netvsc_detach(struct net_device *ndev,
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ struct netvsc_device *nvdev)
> {
> ÂÂÂÂÂÂÂ struct net_device_context *ndev_ctx = netdev_priv(ndev);
> ÂÂÂÂÂÂÂ struct hv_device *hdev = ndev_ctx->device_ctx;
> ÂÂÂÂÂÂÂ int ret;
> Â
> ÂÂÂÂÂÂÂ /* Don't try continuing to try and setup sub channels */
> ÂÂÂÂÂÂÂ if (cancel_work_sync(&nvdev->subchan_work))
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ nvdev->num_chn = 1;
> Â
> ÂÂÂÂÂÂÂ /* If device was up (receiving) then shutdown */
> ÂÂÂÂÂÂÂ if (netif_running(ndev)) {
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ netif_tx_disable(ndev);
> Â
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ret = rndis_filter_close(nvdev);
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ if (ret) {
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ netdev_err(ndev,
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ "unable to close device (ret
> %d).\n", ret);
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ return ret;
> ÂÂÂÂ ÂÂÂÂÂÂÂÂÂÂÂ}
> Â
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ret = netvsc_wait_until_empty(nvdev);
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ if (ret) {
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ netdev_err(ndev,
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ "Ring buffer not empty after
> closing rndis\n");
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ return ret;
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ }
> ÂÂÂÂÂÂÂ }
> Â
> ÂÂÂÂÂÂÂ netif_device_detach(ndev);
> Â
> ÂÂÂÂÂÂÂ rndis_filter_device_remove(hdev, nvdev);
> Â
> ÂÂÂÂÂÂÂ return 0;
> }
> Â
> Thanks,
> Haiyang

Here is one stack trace on a 4.18 kernel, the most recent kernel I
managed to reproduce this bug on.Â
I haven't managed to reproduce on 5.0.0 yet, but I guess some recent
changes to the netvsc driver could be masking the problem, as I tried
backporting those changes to older RHEL 7 kernels and still managed to
reproduce the problem there. I could however be wrong, and any pointers
are still appreciated:

[ÂÂ545.308507] BUG: unable to handle kernel NULL pointer dereference at
0000000000000004
[ÂÂ545.308656] PGD 0 P4D 0Â
[ÂÂ545.308763] Oops: 0000 [#1] SMP PTI
[ÂÂ545.308855] CPU: 2 PID: 1800 Comm: iperf3 Kdump: loaded Not tainted
4.18.0-64.el8.test.x86_64 #1
[ÂÂ545.308990] Hardware name: Microsoft Corporation Virtual
Machine/Virtual Machine, BIOS Hyper-V UEFI Release v1.0 11/26/2012
[ÂÂ545.309143] RIP: 0010:netvsc_send+0x2c9/0xce0 [hv_netvsc]
[ÂÂ545.309298] Code: 4c 8b b1 c0 00 00 00 4f 8d 2c 64 49 c1 e5 07 4d 03
ae c0 03 00 00 48 8b 84 03 30 01 00 00 4c 89 6c 24 18 48 8b 90 20 01 00
00 <8b> 72 04 8b 0a 8b 90 38 01 00 00 89 f7 01 f2 29 cf 29 ca 39 ce
Â0f
[ÂÂ545.309321] RSP: 0018:ffffb8a305d5b6c0 EFLAGS: 00010282
[ÂÂ545.309321] RAX: ffff926928bd7000 RBX: ffff92687dbe0000 RCX:
ffff92687d5bec00
[ÂÂ545.309321] RDX: 0000000000000000 RSI: ffff92691b61c654 RDI:
0000000000000000
[ÂÂ545.309321] RBP: ffff926915dcde28 R08: ffff926915dcde00 R09:
0000000000000000
[ÂÂ545.309321] R10: 00000000000db61c R11: 0000000000000f7e R12:
0000000000000001
[ÂÂ545.309321] R13: ffff926931808180 R14: ffff926931801000 R15:
0000000000000000
[ÂÂ545.309321] FS:ÂÂ00007feca6a4b740(0000) GS:ffff926940080000(0000)
knlGS:0000000000000000
[ÂÂ545.309321] CS:ÂÂ0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ÂÂ545.309321] CR2: 0000000000000004 CR3: 00000000dfccc004 CR4:
00000000003606e0
[ÂÂ545.309321] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ÂÂ545.309321] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ÂÂ545.309321] Call Trace:
[ÂÂ545.309321]ÂÂnetvsc_start_xmit+0x3c9/0x800 [hv_netvsc]
[ÂÂ545.309321]ÂÂ? __switch_to_asm+0x34/0x70
[ÂÂ545.309321]ÂÂ? __switch_to_asm+0x34/0x70
[ÂÂ545.309321]ÂÂ? ___slab_alloc+0x269/0x4e0
[ÂÂ545.309321]ÂÂ? __alloc_skb+0x82/0x1c0
[ÂÂ545.309321]ÂÂ? nft_do_chain+0x3d7/0x3f0 [nf_tables]
[ÂÂ545.309321]ÂÂ? nft_do_chain+0x3d7/0x3f0 [nf_tables]
[ÂÂ545.309321]ÂÂ? nft_do_chain+0x3d7/0x3f0 [nf_tables]
[ÂÂ545.309321]ÂÂ? _cond_resched+0x15/0x30
[ÂÂ545.309321]ÂÂ? netif_skb_features+0x118/0x280
[ÂÂ545.309321]ÂÂdev_hard_start_xmit+0xa5/0x210
[ÂÂ545.309321]ÂÂsch_direct_xmit+0x14f/0x340
[ÂÂ545.309321]ÂÂ__dev_queue_xmit+0x799/0x8f0
[ÂÂ545.309321]ÂÂip_finish_output2+0x2e0/0x430
[ÂÂ545.309321]ÂÂ? ip_finish_output+0x139/0x270
[ÂÂ545.309321]ÂÂip_output+0x6c/0xe0
[ÂÂ545.309321]ÂÂ? ip_append_data.part.50+0xc0/0xc0
[ÂÂ545.309321]ÂÂip_send_skb+0x15/0x40
[ÂÂ545.309321]ÂÂudp_send_skb.isra.43+0x153/0x340
[ÂÂ545.309321]ÂÂudp_sendmsg+0xac2/0xd30
[ÂÂ545.309321]ÂÂ? set_fd_set.part.7+0x40/0x40
[ÂÂ545.309321]ÂÂ? set_fd_set.part.7+0x40/0x40
[ÂÂ545.309321]ÂÂ? __check_object_size+0xa3/0x181
[ÂÂ545.309321]ÂÂ? sock_has_perm+0x78/0xa0
[ÂÂ545.309321]ÂÂ? core_sys_select+0x242/0x2f0
[ÂÂ545.309321]ÂÂ? sock_sendmsg+0x36/0x40
[ÂÂ545.309321]ÂÂ? udp_push_pending_frames+0x60/0x60
[ÂÂ545.309321]ÂÂsock_sendmsg+0x36/0x40
[ÂÂ545.309321]ÂÂsock_write_iter+0x8f/0xf0
[ÂÂ545.309321]ÂÂ__vfs_write+0x156/0x1a0
[ÂÂ545.309321]ÂÂvfs_write+0xa5/0x1a0
[ÂÂ545.309321]ÂÂksys_write+0x4f/0xb0
[ÂÂ545.309321]ÂÂdo_syscall_64+0x5b/0x1b0
[ÂÂ545.309321]ÂÂentry_SYSCALL_64_after_hwframe+0x65/0xca
[ÂÂ545.309321] RIP: 0033:0x7feca5fb5348
[ÂÂ545.309321] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00
00 f3 0f 1e fa 48 8d 05 d5 63 2d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f
05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
[ÂÂ545.309321] RSP: 002b:00007ffc3ff1f108 EFLAGS: 00000246 ORIG_RAX:
0000000000000001
[ÂÂ545.309321] RAX: ffffffffffffffda RBX: 00000000000005a8 RCX:
00007feca5fb5348
[ÂÂ545.309321] RDX: 00000000000005a8 RSI: 00007feca6a59000 RDI:
0000000000000009
[ÂÂ545.309321] RBP: 00007feca6a59000 R08: 0000000000000002 R09:
00cd09a3238b4e43
[ÂÂ545.309321] R10: 0002961ecea49016 R11: 0000000000000246 R12:
0000000000000009
[ÂÂ545.309321] R13: 00000000000005a8 R14: 00007ffc3ff1f180 R15:
0000563c1e05b260
[ÂÂ545.309321] Modules linked in: nft_chain_nat_ipv6 nf_conntrack_ipv6
nf_defrag_ipv6 nf_nat_ipv6 nft_chain_route_ipv6 nft_chain_nat_ipv4
nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat
nft_chain_route_ipv4 nf_conntrack ip_set nf_tables nfnetlink vfat fat
sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
intel_rapl_perf sg hv_utils hv_balloon pcspkr joydev xfs libcrc32c
sd_mod sr_mod cdrom serio_raw hv_storvsc hv_netvsc scsi_transport_fc
hyperv_fb hyperv_keyboard hid_hyperv crc32c_intel hv_vmbus dm_mirror
dm_region_hash dm_log dm_mod [last unloaded: nft_compat]
[ÂÂ545.309321] CR2: 0000000000000004

>From the stack trace netvsc_send+0x2c9 points to this line:

static inline u32 hv_get_bytes_to_write(const struct hv_ring_bu
ffer_info *rbi)
{
ÂÂÂÂÂÂÂÂu32 read_loc, write_loc, dsize, write;

ÂÂÂÂÂÂÂÂdsize = rbi->ring_datasize;
ÂÂÂÂÂÂÂÂread_loc = READ_ONCE(rbi->ring_buffer->read_index); <---------
ÂÂÂÂÂÂÂÂwrite_loc = rbi->ring_buffer->write_index;

ÂÂÂÂÂÂÂÂwrite = write_loc >= read_loc ? dsize - (write_loc - read_loc)
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂread_loc - write_loc;
ÂÂÂÂÂÂÂÂreturn write;
}

which gets called from netvsc_send_pkt().