Re: [PATCH 03/10] nfsd: serialize nfsd4_end_grace() with atomic test-and-set

From: Chuck Lever

Date: Fri May 29 2026 - 11:59:28 EST

On Thu, May 28, 2026, at 5:55 PM, Jeff Layton wrote:
> From: Chris Mason <clm@xxxxxxxx>
>
> nfsd4_end_grace() guards its drain path with a plain bool:
>
> if (nn->grace_ended)
> return;
> nn->grace_ended = true;
>
> The read and the write are independent, and nothing in struct
> nfsd_net serializes them. At least two contexts can reach this
> code with no lock held:
>
> laundromat path
> laundry_wq kworker
> nfs4_laundromat()
> nfsd4_end_grace()
>
> RECLAIM_COMPLETE path
> nfsd compound kthread
> nfsd4_reclaim_complete()
> inc_reclaim_complete()
> nfsd4_end_grace()
>
> Both callers can observe grace_ended == false on different CPUs,
> both store true, and both proceed into nfsd4_record_grace_done(),
> which invokes the active client_tracking_ops->grace_done callback.
> For tracking ops that drain reclaim_str_hashtbl (legacy_tracking_ops
> via nfsd4_recdir_purge_old, and the cld v1+ ops via
> nfsd4_cld_grace_done), grace_done calls nfs4_release_reclaim(),
> which walks every bucket of reclaim_str_hashtbl with no lock and
> calls nfs4_remove_reclaim_record() (list_del + kfree) on each
> entry. Two concurrent walkers corrupt the list and double-free
> every nfs4_client_reclaim. A concurrent nfsd4_find_reclaim_client()
> iterating the same bucket reads through freed memory.
>
> A third call site exists in nfs4_state_start_net() on the
> skip_grace startup path, but it runs under nfsd_mutex before any
> client has connected and before the laundromat's first delayed
> work fires, so it cannot race with the two callers above.
>
> Fix by replacing the read/write pair with try_cmpxchg() so exactly
> one caller transitions grace_ended from false to true and proceeds
> into the drain; the loser returns immediately. bool supports
> 1-byte cmpxchg on all supported architectures, and no lock
> ordering changes are needed.
>
> Fixes: 362063a595be ("nfsd: keep a tally of RECLAIM_COMPLETE operations
> when using nfsdcld")
> Assisted-by: kres:claude-opus-4-7
> Signed-off-by: Chris Mason <clm@xxxxxxxx>
> ---
> fs/nfsd/nfs4state.c | 17 ++++++++++++++---
> 1 file changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index f4d12dbcf97b..dc4ac541436f 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -7022,12 +7022,23 @@ nfsd4_renew(struct svc_rqst *rqstp, struct
> nfsd4_compound_state *cstate,
> static void
> nfsd4_end_grace(struct nfsd_net *nn)
> {
> - /* do nothing if grace period already ended */
> - if (nn->grace_ended)
> + bool expected = false;
> +
> + /*
> + * nfsd4_end_grace() can be entered concurrently from the
> + * laundromat workqueue and from an nfsd compound thread
> + * handling RECLAIM_COMPLETE. Without serialization, both
> + * callers can observe grace_ended==false and proceed into
> + * nfsd4_record_grace_done(). For tracking ops whose
> + * grace_done drains reclaim_str_hashtbl, that results in
> + * list corruption and a double free of every
> + * nfs4_client_reclaim entry. Use an atomic test-and-set so
> + * exactly one caller proceeds.
> + */
> + if (!try_cmpxchg(&nn->grace_ended, &expected, true))
> return;
>
> trace_nfsd_grace_complete(nn);
> - nn->grace_ended = true;
> /*
> * If the server goes down again right now, an NFSv4
> * client will still be allowed to reclaim after it comes back up,
>
> --
> 2.54.0

Seems like the usual idiom for something like this is an atomic
bit op. Perhaps try_cmpxchg on a boolean variable is not going
to behave as you expect on every hardware platform.

--
Chuck Lever