Re: [PATCH v6 5/5] nvme-fabrics: handle transient auth failures

From: Sagi Grimberg
Date: Mon Apr 15 2024 - 16:51:28 EST




On 15/04/2024 15:42, Daniel Wagner wrote:
Retry authentication a couple of times to avoid error out on transient
errors. E.g. if a new key is deployed on the host and the target. At the
same time the connection drops. The host might use the wrong key to
reconnect.

Suggested-by: Sagi Grimberg <sagi@xxxxxxxxxxx>
Signed-off-by: Daniel Wagner <dwagner@xxxxxxx>
---
drivers/nvme/host/auth.c | 4 ++--
drivers/nvme/host/fabrics.c | 10 +++++++++-
drivers/nvme/host/fc.c | 2 ++
drivers/nvme/host/nvme.h | 1 +
drivers/nvme/host/rdma.c | 1 +
drivers/nvme/host/tcp.c | 1 +
6 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/auth.c b/drivers/nvme/host/auth.c
index a264b3ae078b..368dcc6ee11b 100644
--- a/drivers/nvme/host/auth.c
+++ b/drivers/nvme/host/auth.c
@@ -797,7 +797,7 @@ static void nvme_queue_auth_work(struct work_struct *work)
NVME_AUTH_DHCHAP_MESSAGE_SUCCESS1);
if (ret) {
chap->status = ret;
- chap->error = -ECONNREFUSED;
+ chap->error = -EKEYREJECTED;
return;
}
@@ -818,7 +818,7 @@ static void nvme_queue_auth_work(struct work_struct *work)
ret = nvme_auth_process_dhchap_success1(ctrl, chap);
if (ret) {
/* Controller authentication failed */
- chap->error = -ECONNREFUSED;
+ chap->error = -EKEYREJECTED;
goto fail2;
}
diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index d9a73b1b41c4..6dfa45dce232 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -573,8 +573,16 @@ EXPORT_SYMBOL_GPL(nvmf_connect_io_queue);
*/
bool nvmf_should_reconnect(struct nvme_ctrl *ctrl, int status)
{
- if (status < 0)
+ if (status < 0) {
+ /*
+ * authentication errors can be transient, thus retry a couple
+ * of times before giving up.
+ */
+ if (status == -EKEYREJECTED &&
+ ++ctrl->nr_auth_retries < 3)
+ return true;

I did not suggest nr_auth_retries. Where is the 3 coming from? The controller already
has a number of reconnects before it gives up, no reason to add another one.

Just don't return false based on the status if it is a transient authentication error.

The patch just needs to be modified from:
    if (status < 0)
to
    if (status < 0 && status != -EKEYREJECTED Plus a comment that explains it.