RE: [PATCH] hv_sock: Fix data loss upon socket close
From: Sunil Muthuswamy
Date: Tue May 14 2019 - 16:42:47 EST
> -----Original Message-----
> From: Dexuan Cui <decui@xxxxxxxxxxxxx>
> Sent: Friday, May 10, 2019 8:57 PM
> To: Sunil Muthuswamy <sunilmut@xxxxxxxxxxxxx>; KY Srinivasan <kys@xxxxxxxxxxxxx>; Haiyang Zhang <haiyangz@xxxxxxxxxxxxx>;
> Stephen Hemminger <sthemmin@xxxxxxxxxxxxx>; Sasha Levin <sashal@xxxxxxxxxx>; David S. Miller <davem@xxxxxxxxxxxxx>;
> Michael Kelley <mikelley@xxxxxxxxxxxxx>
> Cc: netdev@xxxxxxxxxxxxxxx; linux-hyperv@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx
> Subject: RE: [PATCH] hv_sock: Fix data loss upon socket close
>
> > From: Sunil Muthuswamy <sunilmut@xxxxxxxxxxxxx>
> > Sent: Wednesday, May 8, 2019 4:11 PM
> >
> > Currently, when a hvsock socket is closed, the socket is shutdown and
> > immediately a RST is sent. There is no wait for the FIN packet to arrive
> > from the other end. This can lead to data loss since the connection is
> > terminated abruptly. This can manifest easily in cases of a fast guest
> > hvsock writer and a much slower host hvsock reader. Essentially hvsock is
> > not following the proper STREAM(TCP) closing handshake mechanism.
>
> Hi Sunil,
> It looks to me the above description is inaccurate.
>
> In the upstream Linux kernel, closing a hv_sock file descriptor may hang
> in vmbus_hvsock_device_unregister() -> msleep(), until the host side of
> the connection is closed. This is bad and should be fixed, but I don't think
> the current code can cause data loss: when Linux calls hvs_destruct() ->
> vmbus_hvsock_device_unregister() -> vmbus_device_unregister() -> ...
> -> vmbus_close() to close the channel, Linux knows the host app has
> already called close(), and normally that means the host app has
> received all the data from the connection.
>
> BTW, technically speaking, in hv_sock there is no RST packet, while there
> is indeed a payload_len==0 packet, which is similar to TCP FIN.
>
> I think by saying "a RST is sent" you mean Linux VM closes the channel.
>
> > The fix involves adding support for the delayed close of hvsock, which is
> > in-line with other socket providers such as virtio.
>
> With this "delayed close" patch, Linux's close() won't hang until the host
> also closes the connection. This is good!
>
The next version of the patch will only focus on implementing the delayed
(or background) close logic. I will update the title and the description of the
next version patch to more accurately reflect the change.
> > While closing, the
> > socket waits for a constant timeout, for the FIN packet to arrive from the
> > other end. On timeout, it will terminate the connection (i.e a RST).
>
> As I mentioned above, I suppose the "RST" means Linux closes the channel.
>
> When Linux closes a connection, the FIN packet is written into the shared
> guest-to-host channel ringbuffer immediately, so the host is able to see it
> immediately, but the real question is: what if the host kernel and/or host app
> can not (timely) receive the data from the ringbuffer, inclding the FIN?
>
> Does the host kernel guarantee it *always* timely fetches/caches all the
> data from a connection, even if the host app has not accept()'d the
> conection, or the host app is reading from the connection too slowly?
>
> If the host doesn't guarantee that, then even with this patch there is still
> a chance Linux can time out, and close the channel before the host
> finishes receiving all the data.
>
TCP protocol does not guarantee that all the data gets delivered, especially
during close. The applications are required to mitigate this by using a
combination of shutdown() and subsequent read() on both client and server
side.
> I'm curious how Windows guest implements the "async close"?
> Does Windows guest also use the same timeout strategy here? If yes,
> what's the timeout value used?
>
Windows also implements the delayed close logic in a similar fashion. You can
lookup the MSDN article on 'graceful shutdown'.
> > diff --git a/net/vmw_vsock/hyperv_transport.c
> > b/net/vmw_vsock/hyperv_transport.c
> > index a827547..62b986d 100644
>
> Sorry, I need more time to review the rest of patch. Will try to reply ASAP.
>
> > -static int hvs_update_recv_data(struct hvsock *hvs)
> > +static int hvs_update_recv_data(struct vsock_sock *vsk)
> > {
> > struct hvs_recv_buf *recv_buf;
> > u32 payload_len;
> > + struct hvsock *hvs = vsk->trans;
> >
> > recv_buf = (struct hvs_recv_buf *)(hvs->recv_desc + 1);
> > payload_len = recv_buf->hdr.data_size;
> > @@ -543,8 +591,12 @@ static int hvs_update_recv_data(struct hvsock *hvs)
> > if (payload_len > HVS_MTU_SIZE)
> > return -EIO;
> >
> > - if (payload_len == 0)
> > + /* Peer shutdown */
> > + if (payload_len == 0) {
> > + struct sock *sk = sk_vsock(vsk);
> > hvs->vsk->peer_shutdown |= SEND_SHUTDOWN;
> > + sk->sk_state_change(sk);
> > + }
>
> Can you please explain why we need to call this sk->sk_state_change()?
>
> When we call hvs_update_recv_data(), we hold the lock_sock(sk) lock, and we
> know there is at least one byte to read. Since we hold the lock, the other
> code paths, which normally are also requried to acquire the lock before
> checking vsk->peer_shutdown, can not race with us.
>
This was to wakeup any waiters on socket state change. Since the updated
patch now only focuses on adding the delayed close logic, I will remove this in
the next version.
Thanks for the review so far.
> Thanks,
> -- Dexuan