Re: [PATCH 1/3] nbd: fix connection timed out error after reconnecting to server

From: Josef Bacik
Date: Tue May 28 2019 - 13:01:25 EST


On Tue, May 28, 2019 at 02:07:43AM +0800, Yao Liu wrote:
> On Fri, May 24, 2019 at 09:07:42AM -0400, Josef Bacik wrote:
> > On Fri, May 24, 2019 at 05:43:54PM +0800, Yao Liu wrote:
> > > Some I/O requests that have been sent succussfully but have not yet been
> > > replied won't be resubmitted after reconnecting because of server restart,
> > > so we add a list to track them.
> > >
> > > Signed-off-by: Yao Liu <yotta.liu@xxxxxxxxx>
> >
> > Nack, this is what the timeout stuff is supposed to handle. The commands will
> > timeout and we'll resubmit them if we have alive sockets. Thanks,
> >
> > Josef
> >
>
> On the one hand, if num_connections == 1 and the only sock has dead,
> then we do nbd_genl_reconfigure to reconnect within dead_conn_timeout,
> nbd_xmit_timeout will not resubmit commands that have been sent
> succussfully but have not yet been replied. The log is as follows:
>
> [270551.108746] block nbd0: Receive control failed (result -104)
> [270551.108747] block nbd0: Send control failed (result -32)
> [270551.108750] block nbd0: Request send failed, requeueing
> [270551.116207] block nbd0: Attempted send on invalid socket
> [270556.119584] block nbd0: reconnected socket
> [270581.161751] block nbd0: Connection timed out
> [270581.165038] block nbd0: shutting down sockets
> [270581.165041] print_req_error: I/O error, dev nbd0, sector 5123224 flags 8801
> [270581.165149] print_req_error: I/O error, dev nbd0, sector 5123232 flags 8801
> [270581.165580] block nbd0: Connection timed out
> [270581.165587] print_req_error: I/O error, dev nbd0, sector 844680 flags 8801
> [270581.166184] print_req_error: I/O error, dev nbd0, sector 5123240 flags 8801
> [270581.166554] block nbd0: Connection timed out
> [270581.166576] print_req_error: I/O error, dev nbd0, sector 844688 flags 8801
> [270581.167124] print_req_error: I/O error, dev nbd0, sector 5123248 flags 8801
> [270581.167590] block nbd0: Connection timed out
> [270581.167597] print_req_error: I/O error, dev nbd0, sector 844696 flags 8801
> [270581.168021] print_req_error: I/O error, dev nbd0, sector 5123256 flags 8801
> [270581.168487] block nbd0: Connection timed out
> [270581.168493] print_req_error: I/O error, dev nbd0, sector 844704 flags 8801
> [270581.170183] print_req_error: I/O error, dev nbd0, sector 5123264 flags 8801
> [270581.170540] block nbd0: Connection timed out
> [270581.173333] block nbd0: Connection timed out
> [270581.173728] block nbd0: Connection timed out
> [270581.174135] block nbd0: Connection timed out
>
> On the other hand, if we wait nbd_xmit_timeout to handle resubmission,
> the I/O requests will have a big delay. For example, if timeout time is 30s,
> and from sock dead to nbd_genl_reconfigure returned OK we only spend
> 2s, the I/O requests will still be handled by nbd_xmit_timeout after 30s.

We have to wait for the full timeout anyway to know that the socket went down,
so it'll be re-submitted right away and then we'll wait on the new connection.

Now we could definitely have requests that were submitted well after the first
thing that failed, so their timeout would be longer than simply retrying them,
but we have no idea of knowing which ones timed out and which ones didn't. This
way lies pain, because we have to matchup tags with handles. This is why we
rely on the generic timeout infrastructure, so everything is handled correctly
without ending up with duplicate submissions/replies. Thanks,

Josef