[PATCH v4 1/2] nfsd: overhaul CB_SEQUENCE error handling

From: Jeff Layton
Date: Fri Feb 07 2025 - 08:45:33 EST

Next message: Jeff Layton: "[PATCH v4 2/2] nfsd: lift NFSv4.0 handling out of nfsd4_cb_sequence_done()"
Previous message: Jeff Layton: "[PATCH v4 0/2] nfsd: CB_SEQUENCE error handling fixes and cleanups"
In reply to: Jeff Layton: "[PATCH v4 0/2] nfsd: CB_SEQUENCE error handling fixes and cleanups"
Next in thread: Jeff Layton: "[PATCH v4 2/2] nfsd: lift NFSv4.0 handling out of nfsd4_cb_sequence_done()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Some of the existing error handling in nfsd4_cb_sequence_done is
incorrect. For instance, if a NFS4ERR_SEQ_MISORDERED or NFS4ERR_BADSLOT
status races with rpc_clnt teardown, then the callback could be dropped
on the floor and never resent. This patch first does some general cleanup
and then reworks most of the error handling cases from first principles:

There is only one case where we want to proceed with processing the rest
of the CB_COMPOUND, and that's when the cb_seq_status is 0. Make the
default return value be false, and only set it to true in that case.

Rename the "need_restart" label to "requeue", to better indicate that
it's being requeued to the workqueue.

When the callback isn't going to be requeued or restarted, don't bother
checking RPC_SIGNALLED(). That should only be checked when we intend to
restart an rpc_task.

For -ESERVERFAULT, don't increment the sequence number since it's
possible the sessionid didn't match. Mark the backchannel bad and wait
for it to be recreated.

For NFS4ERR_DELAY, test for RPC_SIGNALLED() first, since that's an
indication that the client is being torn down. Requeue it in that case.

When restarting the RPC (but not the entire callback), test
RPC_SIGNALLED(). If it has been, then fall through to restart the
callback instead of just restarting the RPC.

For NFS4ERR_BADSLOT, mark the backchannel faulty and leak the slot so
that it can't be reused.

For NFS4ERR_SEQ_MISORDERED, retry once with seq_nr of 1, and if that
fails, handle it like BADSLOT.

When requeuing a callback, always release the slot in case there
have been changes to the slot table or session in the interim when it's
resubmitted.

Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx>
---
fs/nfsd/nfs4callback.c | 64 +++++++++++++++++++++++++++++++++-----------------
1 file changed, 43 insertions(+), 21 deletions(-)

diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
index cf6d29828f4e561418b812ea2c9402929dd52bd0..a8603c3cdcdafbf9ccc6296425112f1730419b7b 100644
--- a/fs/nfsd/nfs4callback.c
+++ b/fs/nfsd/nfs4callback.c
@@ -1332,7 +1332,7 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback
{
struct nfs4_client *clp = cb->cb_clp;
struct nfsd4_session *session = clp->cl_cb_session;
- bool ret = true;
+ bool ret = false;

if (!clp->cl_minorversion) {
/*
@@ -1345,13 +1345,13 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback
* handle that case here.
*/
if (RPC_SIGNALLED(task))
- goto need_restart;
+ goto requeue;

return true;
}

if (cb->cb_held_slot < 0)
- goto need_restart;
+ goto requeue;

/* This is the operation status code for CB_SEQUENCE */
trace_nfsd_cb_seq_status(task, cb);
@@ -1365,11 +1365,15 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback
* (sequence ID, cached reply) MUST NOT change.
*/
++session->se_cb_seq_nr[cb->cb_held_slot];
+ ret = true;
break;
case -ESERVERFAULT:
- ++session->se_cb_seq_nr[cb->cb_held_slot];
+ /*
+ * Call succeeded, but CB_SEQUENCE reply failed sanity checks.
+ * The client has gone insane. Mark the BC faulty, since there
+ * isn't much else we can do.
+ */
nfsd4_mark_cb_fault(cb->cb_clp);
- ret = false;
break;
case 1:
/*
@@ -1380,39 +1384,57 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback
*/
fallthrough;
case -NFS4ERR_BADSESSION:
+ /* Mark the backchannel faulty. Restart the call. */
nfsd4_mark_cb_fault(cb->cb_clp);
- ret = false;
- goto need_restart;
+ goto requeue;
case -NFS4ERR_DELAY:
+ /*
+ * If the rpc_clnt is being torn down, then we must restart
+ * the call from scratch.
+ */
+ if (RPC_SIGNALLED(task))
+ goto requeue;
cb->cb_seq_status = 1;
- if (!rpc_restart_call(task))
- goto out;
-
+ rpc_restart_call(task);
rpc_delay(task, 2 * HZ);
return false;
- case -NFS4ERR_BADSLOT:
- goto retry_nowait;
case -NFS4ERR_SEQ_MISORDERED:
+ /*
+ * Reattempt once with seq_nr 1. If that fails, treat this
+ * like BADSLOT.
+ */
if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) {
session->se_cb_seq_nr[cb->cb_held_slot] = 1;
goto retry_nowait;
}
- break;
+ fallthrough;
+ case -NFS4ERR_BADSLOT:
+ /*
+ * BADSLOT means that the client and server are out of sync
+ * on the BC parameters. In this case, we want to mark the
+ * backchannel faulty and then try it again, but _leak_ the
+ * slot so no one uses it.
+ */
+ nfsd4_mark_cb_fault(cb->cb_clp);
+ cb->cb_held_slot = -1;
+ goto retry_nowait;
default:
nfsd4_mark_cb_fault(cb->cb_clp);
}
trace_nfsd_cb_free_slot(task, cb);
nfsd41_cb_release_slot(cb);
-
- if (RPC_SIGNALLED(task))
- goto need_restart;
-out:
return ret;
retry_nowait:
- if (rpc_restart_call_prepare(task))
- ret = false;
- goto out;
-need_restart:
+ /*
+ * RPC_SIGNALLED() means that the rpc_client is being torn down and
+ * (possibly) recreated. Restart the call completely in that case.
+ */
+ if (!RPC_SIGNALLED(task)) {
+ rpc_restart_call_prepare(task);
+ return false;
+ }
+requeue:
+ nfsd41_cb_release_slot(cb);
if (!test_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags)) {
trace_nfsd_cb_restart(clp, cb);
task->tk_status = 0;

--
2.48.1

Next message: Jeff Layton: "[PATCH v4 2/2] nfsd: lift NFSv4.0 handling out of nfsd4_cb_sequence_done()"
Previous message: Jeff Layton: "[PATCH v4 0/2] nfsd: CB_SEQUENCE error handling fixes and cleanups"
In reply to: Jeff Layton: "[PATCH v4 0/2] nfsd: CB_SEQUENCE error handling fixes and cleanups"
Next in thread: Jeff Layton: "[PATCH v4 2/2] nfsd: lift NFSv4.0 handling out of nfsd4_cb_sequence_done()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]