Re: [PATCH v2] RDMA/cma: Make CM response timeout and # CM retries configurable

From: Doug Ledford
Date: Thu Jun 13 2019 - 11:07:29 EST


On Tue, 2019-02-26 at 08:57 +0100, HÃkon Bugge wrote:
> During certain workloads, the default CM response timeout is too
> short, leading to excessive retries. Hence, make it configurable
> through sysctl. While at it, also make number of CM retries
> configurable.
>
> The defaults are not changed.
>
> Signed-off-by: HÃkon Bugge <haakon.bugge@xxxxxxxxxx>
> ---
> v1 -> v2:
> * Added unregister_net_sysctl_table() in cma_cleanup()
> ---
> drivers/infiniband/core/cma.c | 52 ++++++++++++++++++++++++++++++---
> --
> 1 file changed, 45 insertions(+), 7 deletions(-)

This has been sitting on patchworks since forever. Presumably because
Jason and I neither one felt like we really wanted it, but also
couldn't justify flat refusing it. Well, I've made up my mind, so
unless Jason wants to argue the other side, I'm rejecting this patch.
Here's why. The whole concept of a timeout is to help recovery in a
situation that overloads one end of the connection. There is a
relationship between the max queue backlog on the one host and the
timeout on the other host. Generally, in order for a request to get
dropped and us to need to retransmit, the queue must already have a
full backlog. So, how long does it take a heavily loaded system to
process a full backlog? That, plus a fuzz for a margin of error,
should be our timeout. We shouldn't be asking users to configure it.

However, if users change the default backlog queue on their systems,
*then* it would make sense to have the users also change the timeout
here, but I think guidance would be helpful.

So, to revive this patch, what I'd like to see is some attempt to
actually quantify a reasonable timeout for the default backlog depth,
then the patch should actually change the default to that reasonable
timeout, and then put in the ability to adjust the timeout with some
sort of doc guidance on how to calculate a reasonable timeout based on
configured backlog depth.

--
Doug Ledford <dledford@xxxxxxxxxx>
GPG KeyID: B826A3330E572FDD
Key fingerprint = AE6B 1BDA 122B 23B4 265B 1274 B826 A333 0E57
2FDD

Attachment: signature.asc
Description: This is a digitally signed message part