[PATCH] nfsd: hold a lighter-weight client reference over CB_RECALL_ANY

From: Jeff Layton
Date: Fri Apr 05 2024 - 13:01:05 EST


Currently the CB_RECALL_ANY job takes a cl_rpc_users reference to the
client. While a callback job is technically an RPC, this has the effect
of preventing the client from being unhashed.

If nfsd decides to send a CB_RECALL_ANY just as the client reboots, we
can end up in a situation where the callback can't complete on the (now
dead) callback channel, but the new client also can't connect because
the old client can't be unhashed.

The job is only holding a reference to the client so it can clear a flag
in the client after it completes. This patch attempts to alleviate this
by having the job instead hold a reference to the cl_nfsdfs.cl_ref.

Typically we only take that sort of reference when dealing with the
nfsdfs info files, but it should work appropriately here too.

Reported-by: Vladimir Benes <vbenes@xxxxxxxxxx>
Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx>
---
This patch seems to break the livelock in my testing. It's not the
prettiest fix, but it's narrowly targeted and should be appropriate for
6.9-rc. Longer term, I think we need to rework how the nfs4_clients
refcounts are managed, but that's a larger project.

Many thanks to Vladimir for all his help with tracking this down!
---
fs/nfsd/nfs4state.c | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 5fcd93f7cb8c..4311d608a297 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -3042,12 +3042,9 @@ static void
nfsd4_cb_recall_any_release(struct nfsd4_callback *cb)
{
struct nfs4_client *clp = cb->cb_clp;
- struct nfsd_net *nn = net_generic(clp->net, nfsd_net_id);

- spin_lock(&nn->client_lock);
clear_bit(NFSD4_CLIENT_CB_RECALL_ANY, &clp->cl_flags);
- put_client_renew_locked(clp);
- spin_unlock(&nn->client_lock);
+ drop_client(clp);
}

static int
@@ -6613,10 +6610,12 @@ deleg_reaper(struct nfsd_net *nn)
clp->cl_ra_time < 5)) {
continue;
}
- list_add(&clp->cl_ra_cblist, &cblist);

/* release in nfsd4_cb_recall_any_release */
- atomic_inc(&clp->cl_rpc_users);
+ if (!kref_get_unless_zero(&clp->cl_nfsdfs.cl_ref))
+ continue;
+
+ list_add(&clp->cl_ra_cblist, &cblist);
set_bit(NFSD4_CLIENT_CB_RECALL_ANY, &clp->cl_flags);
clp->cl_ra_time = ktime_get_boottime_seconds();
}

---
base-commit: 05258a0a69b3c5d2c003f818702c0a52b6fea861
change-id: 20240405-rhel-31513-028ab6f14252

Best regards,
--
Jeff Layton <jlayton@xxxxxxxxxx>