Re: [PATCH v2 2/2] nfs: update inode ctime after removexattr operation

From: Jeff Layton

Date: Tue Mar 31 2026 - 17:23:27 EST


On Tue, 2026-03-31 at 11:47 -0700, Thomas Haynes wrote:
> On Tue, Mar 31, 2026 at 07:42:56AM -0800, Jeff Layton wrote:
> > On Fri, 2026-03-27 at 11:22 -0700, Thomas Haynes wrote:
> > > On Fri, Mar 27, 2026 at 12:59:54PM -0800, Jeff Layton wrote:
> > > > On Fri, 2026-03-27 at 12:20 -0400, Olga Kornievskaia wrote:
> > > > > On Fri, Mar 27, 2026 at 11:50 AM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > > > >
> > > > > > On Fri, 2026-03-27 at 11:11 -0400, Olga Kornievskaia wrote:
> > > > > > > On Tue, Mar 24, 2026 at 1:32 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > > > > > >
> > > > > > > > xfstest generic/728 fails with delegated timestamps. The client does a
> > > > > > > > removexattr and then a stat to test the ctime, which doesn't change. The
> > > > > > > > stat() doesn't trigger a GETATTR because of the delegated timestamps, so
> > > > > > > > it relies on the cached ctime, which is wrong.
> > > > > > > >
> > > > > > > > The setxattr compound has a trailing GETATTR, which ensures that its
> > > > > > > > ctime gets updated. Follow the same strategy with removexattr.
> > > > > > >
> > > > > > > This approach relies on the fact that the server the serves delegated
> > > > > > > attributes would update change_attr on operations which might now
> > > > > > > necessarily happen (ie, linux server does not update change_attribute
> > > > > > > on writes or clone). I propose an alternative fix for the failing
> > > > > > > generic/728.
> > > > > > >
> > > > > > > diff --git a/fs/nfs/nfs42proc.c b/fs/nfs/nfs42proc.c
> > > > > > > index 7b3ca68fb4bb..ede1835a45b3 100644
> > > > > > > --- a/fs/nfs/nfs42proc.c
> > > > > > > +++ b/fs/nfs/nfs42proc.c
> > > > > > > @@ -1389,7 +1389,13 @@ static int _nfs42_proc_removexattr(struct inode
> > > > > > > *inode, const char *name)
> > > > > > > &res.seq_res, 1);
> > > > > > > trace_nfs4_removexattr(inode, name, ret);
> > > > > > > if (!ret)
> > > > > > > - nfs4_update_changeattr(inode, &res.cinfo, timestamp, 0);
> > > > > > > + if (nfs_have_delegated_attributes(inode)) {
> > > > > > > + nfs_update_delegated_mtime(inode);
> > > > > > > + spin_lock(&inode->i_lock);
> > > > > > > + nfs_set_cache_invalid(inode, NFS_INO_INVALID_BLOCKS);
> > > > > > > + spin_unlock(&inode->i_lock);
> > > > > > > + } else
> > > > > > > + nfs4_update_changeattr(inode, &res.cinfo, timestamp, 0);
> > > > > > >
> > > > > > > return ret;
> > > > > > > }
> > > > > > >
> > > > > >
> > > > > > What's the advantage of doing it this way?
> > > > > >
> > > > > > You just sent a REMOVEXATTR operation to the server that will change
> > > > > > the mtime there. The server has the most up-to-date version of the
> > > > > > mtime and ctime at that point.
> > > > >
> > > > > In presence of delegated attributes, Is the server required to update
> > > > > its mtime/ctime on an operation? As I mentioned, the linux server does
> > > > > not update its ctime/mtime for WRITE, CLONE, COPY.
> > > > >
> > > > > Is possible that
> > > > > some implementations might be different and also do not update the
> > > > > ctime/mtime on REMOVEXATTR?
> > > > >
> > > > > Therefore I was suggesting that the patch
> > > > > relies on the fact that it would receive an updated value. Of course
> > > > > perhaps all implementations are done the same as the linux server and
> > > > > my point is moot. I didn't see anything in the spec that clarifies
> > > > > what the server supposed to do (and client rely on).
> > > > >
> > > >
> > > > (cc'ing Tom)
> > > >
> > > > That is a very good point.
> > > >
> > > > My interpretation was that delegated timestamps generally covered
> > > > writes, but SETATTR style operations that do anything beyond only
> > > > changing the mtime can't be cached.
> > > >
> > > > We probably need some delstid spec clarification: for what operations
> > > > is the server required to disable timestamp updates when a write
> > > > delegation is outstanding?
> > > >
> > > > In the case of nfsd, we disable timestamp updates for WRITE/COPY/CLONE
> > > > but not SETATTR/SETXATTR/REMOVEXATTR.
> > > >
> > > > How does the Hammerspace anvil behave? Does it disable c/mtime updates
> > > > for writes when there is an outstanding timestamp delegation like we're
> > > > doing in nfsd? If so, does it do the same for
> > > > SETATTR/SETXATTR/REMOVEXATTR operations as well?
> > >
> > > Jeff,
> > >
> > > I think the right way to look at this is closer to how size is
> > > handled under delegation in RFC8881, rather than as a per-op rule.
> > >
> > > In our implementation, because we are acting as an MDS and data I/O
> > > goes to DSes, we already treat size as effectively delegated when
> > > a write layout is outstanding. The MDS does not maintain authoritative
> > > size locally in that case. We may refresh size/timestamps internally
> > > (e.g., on GETATTR by querying DSes), but we don’t treat that as
> > > overriding the delegated authority.
> > >
> > > For timestamps, our behavior is effectively the same model. When
> > > the client holds the relevant delegation, the server does not
> > > consider itself authoritative for ctime/mtime. If current values
> > > are needed, we can obtain them from the client (e.g., via CB_GETATTR),
> > > and the client must present the delegation stateid to demonstrate
> > > that authority. So the authority follows the delegation, not the
> > > specific operation.
> > >
> > > That said, I don’t think we’ve fully resolved the semantics for all
> > > metadata-style ops either. WRITE and SETATTR are clear in our model,
> > > but for things like CLONE/COPY/SETXATTR/REMOVEXATTR, we’ve likely
> > > been relying on assumptions rather than a fully consistent rule.
> > > I.e., CLONE and COPY we just pass through to the DS and we don't
> > > implement SETXATTR/REMOVEXATTR.
> > >
> > > So the spec question, as I see it, is not whether REMOVEXATTR (or
> > > any particular op) should update ctime/mtime, but whether delegated
> > > timestamps are meant to follow the same attribute-authority model
> > > as delegated size in RFC8881. If so, then we expect that the server
> > > should query the client via CB_GETATTR to return updated ctime/mtime
> > > after such operations while the delegation is outstanding.
> > >
> >
> > The dilemma we have is: because we _do_ allow local processes to stat()
> > files that have an outstanding write delegation, we can never allow the
> > ctime in particular to roll backward (modulo clock jumps).
>
> I agree we do not want visible ctime rollback, but I do not see how
> that can be guaranteed from delegated timestamps alone when the
> authoritative timestamp may be generated on a different node with
> a different clock and the object may change during the CB_GETATTR
> window. That seems to require either monotonic clamping of exposed
> ctime, or treating change_attr rather than ctime as the real
> serialization signal.
>
> >
> > If we're dealing with changes that have been cached in the client and
> > are being lazily flushed out, then we can't update the timestamp when
> > that operation occurs. The time of the RPC to flush the changes will
> > almost certainly be later than the cached timestamps on the client that
> > will eventually be set, so when the client comes back we'd end up
> > violating the rollback rule.
> >
> > Our only option is to freeze timestamp updates on anything that might
> > represent such an operation. So far, we only do that on WRITE and COPY
> > operations -- in general, operations that require an open file, since
> > FMODE_NOCMTIME is attached to the file.
> >
> > Some SETATTRs that only update the mtime and atime can be cached on the
> > client by virtue of the fact that it's authoritative for timestamps.
> > There are some exceptions though:
> >
> > - atime-only updates can't be cached since the ctime won't change with
> > a timestamp update if the mtime didn't change
> >
> > - if you set the mtime to a time that is later than the time you got
> > the delegation from the server, but earlier than the current time, you
> > can't cache that. The ctime would be later than the mtime in that case,
> > and we don't have a mechanism to handle that in a delegated timestamp
> > SETATTR.
> >
> > I don't see how you could reasonably buffer a SETXATTR or REMOVEXATTR
> > operation to be sent later. These need to be done synchronously since
> > they could always fail for some reason and we don't have a mechanism at
> > the syscall layer to handle a deferred error. They also only update the
> > ctime and not the mtime, and we have no mechanism to do that with
> > delegated timestamps.
> >
> > Based on that, I think the client and server both need to ignore the
> > timestamp delegation on a SETXATTR or REMOVEXATTR. The server should
> > update the ctime and the client needs to send a trailing GETATTR on the
> > REMOVEXATTR compound in order to get it and the change attr.
>
> One concern I have with a per-op formulation is extensibility. If
> delegated timestamp behavior is defined by enumerating specific
> operations, then every new operation added to the protocol creates
> a fresh ambiguity until the spec is updated again. It seems better
> to define the behavior in terms of operation properties - e.g., whether
> the operation is synchronously visible, can be deferred/cached at
> the client, and whether it affects only ctime versus mtime/atime -
> so future operations can be classified without reopening the base
> rule.
>
> I.e., I can't tell if you want me to update the spec with
> guidance per-op or you are just documenting what you did.
>
>

I think we probably need some guidance in the spec, and I think that
guidance comes down to: operations that don't have a way to report a
delayed error condition can't be buffered on the client and must
continue to be done synchronously even if a delegation is held.

By way of example: if I do a write() on the client I can buffer that
because userland can eventually do an fsync() to see if it succeeded.
This is not true for syscalls like setxattr() or removexattr(), or most
syscalls that result in a SETATTR operation (chmod(), chown(), etc).

They must be done synchronously because:

1/ there's no way to update only the ctime in a delegated timestamp
update

...and...

2/ these syscalls can fail, so we can't return from them until we know
the outcome.

How best to phrase this guidance, I'm not sure...
--
Jeff Layton <jlayton@xxxxxxxxxx>