Re: [PATCH 03/10] nfsd: serialize nfsd4_end_grace() with atomic test-and-set

From: Jeff Layton

Date: Fri May 29 2026 - 12:34:56 EST

On Fri, 2026-05-29 at 11:38 -0400, Chuck Lever wrote:
>
> On Thu, May 28, 2026, at 5:55 PM, Jeff Layton wrote:
> > From: Chris Mason <clm@xxxxxxxx>
> >
> > nfsd4_end_grace() guards its drain path with a plain bool:
> >
> > if (nn->grace_ended)
> > return;
> > nn->grace_ended = true;
> >
> > The read and the write are independent, and nothing in struct
> > nfsd_net serializes them. At least two contexts can reach this
> > code with no lock held:
> >
> > laundromat path
> > laundry_wq kworker
> > nfs4_laundromat()
> > nfsd4_end_grace()
> >
> > RECLAIM_COMPLETE path
> > nfsd compound kthread
> > nfsd4_reclaim_complete()
> > inc_reclaim_complete()
> > nfsd4_end_grace()
> >
> > Both callers can observe grace_ended == false on different CPUs,
> > both store true, and both proceed into nfsd4_record_grace_done(),
> > which invokes the active client_tracking_ops->grace_done callback.
> > For tracking ops that drain reclaim_str_hashtbl (legacy_tracking_ops
> > via nfsd4_recdir_purge_old, and the cld v1+ ops via
> > nfsd4_cld_grace_done), grace_done calls nfs4_release_reclaim(),
> > which walks every bucket of reclaim_str_hashtbl with no lock and
> > calls nfs4_remove_reclaim_record() (list_del + kfree) on each
> > entry. Two concurrent walkers corrupt the list and double-free
> > every nfs4_client_reclaim. A concurrent nfsd4_find_reclaim_client()
> > iterating the same bucket reads through freed memory.
> >
> > A third call site exists in nfs4_state_start_net() on the
> > skip_grace startup path, but it runs under nfsd_mutex before any
> > client has connected and before the laundromat's first delayed
> > work fires, so it cannot race with the two callers above.
> >
> > Fix by replacing the read/write pair with try_cmpxchg() so exactly
> > one caller transitions grace_ended from false to true and proceeds
> > into the drain; the loser returns immediately. bool supports
> > 1-byte cmpxchg on all supported architectures, and no lock
> > ordering changes are needed.
> >
> > Fixes: 362063a595be ("nfsd: keep a tally of RECLAIM_COMPLETE operations
> > when using nfsdcld")
> > Assisted-by: kres:claude-opus-4-7
> > Signed-off-by: Chris Mason <clm@xxxxxxxx>
> > ---
> > fs/nfsd/nfs4state.c | 17 ++++++++++++++---
> > 1 file changed, 14 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > index f4d12dbcf97b..dc4ac541436f 100644
> > --- a/fs/nfsd/nfs4state.c
> > +++ b/fs/nfsd/nfs4state.c
> > @@ -7022,12 +7022,23 @@ nfsd4_renew(struct svc_rqst *rqstp, struct
> > nfsd4_compound_state *cstate,
> > static void
> > nfsd4_end_grace(struct nfsd_net *nn)
> > {
> > - /* do nothing if grace period already ended */
> > - if (nn->grace_ended)
> > + bool expected = false;
> > +
> > + /*
> > + * nfsd4_end_grace() can be entered concurrently from the
> > + * laundromat workqueue and from an nfsd compound thread
> > + * handling RECLAIM_COMPLETE. Without serialization, both
> > + * callers can observe grace_ended==false and proceed into
> > + * nfsd4_record_grace_done(). For tracking ops whose
> > + * grace_done drains reclaim_str_hashtbl, that results in
> > + * list corruption and a double free of every
> > + * nfs4_client_reclaim entry. Use an atomic test-and-set so
> > + * exactly one caller proceeds.
> > + */
> > + if (!try_cmpxchg(&nn->grace_ended, &expected, true))
> > return;
> >
> > trace_nfsd_grace_complete(nn);
> > - nn->grace_ended = true;
> > /*
> > * If the server goes down again right now, an NFSv4
> > * client will still be allowed to reclaim after it comes back up,
> >
> > --
> > 2.54.0
>
> Seems like the usual idiom for something like this is an atomic
> bit op. Perhaps try_cmpxchg on a boolean variable is not going
> to behave as you expect on every hardware platform.

We just need a single flag here though. try_cmpxchg() had better work
the same way on every platform or a lot of stuff is FUBAR. Where
wouldn't it?
--
Jeff Layton <jlayton@xxxxxxxxxx>