Re: [man-pages RFC PATCH v4] statx, inode: document the new STATX_INO_VERSION field

From: Dave Chinner
Date: Wed Sep 21 2022 - 17:41:37 EST


On Wed, Sep 21, 2022 at 06:33:28AM -0400, Jeff Layton wrote:
> On Wed, 2022-09-21 at 10:00 +1000, Dave Chinner wrote:
> > > How do we determine what that offset should be? Your last email
> > > suggested that there really is no limit to the number of i_version bumps
> > > that can happen in memory before one of them makes it to disk. What can
> > > we do to address that?
> >
> > <shrug>
> >
> > I'm just pointing out problems I see when defining this as behaviour
> > for on-disk format purposes. If we define it as part of the on-disk
> > format, then we have to be concerned about how it may be used
> > outside the scope of just the NFS server application.
> >
> > However, If NFS keeps this metadata and functionaly entirely
> > contained at the application level via xattrs, I really don't care
> > what algorithm NFS developers decides to use for their crash
> > sequencing. It's not my concern at this point, and that's precisely
> > why NFS should be using xattrs for this NFS specific functionality.
> >
>
> I get it: you'd rather not have to deal with what you see as an NFS
> problem, but I don't get how what you're proposing solves anything. We
> might be able to use that scheme to detect crashes, but that's only part
> of the problem (and it's a relatively simple part of the problem to
> solve, really).
>
> Maybe you can clarify it for me:
>
> Suppose we go with what you're saying and store some information in
> xattrs that allows us to detect crashes in some fashion. The server
> crashes and comes back up and we detect that there was a crash earlier.
>
> What does nfsd need to do now to ensure that it doesn't hand out a
> duplicate change attribute?

As I've already stated, the NFS server can hold the persistent NFS
crash counter value in a second xattr that it bumps whenever it
detects a crash and hence we take the local filesystem completely
out of the equation. How the crash counter is then used by the nfsd
to fold it into the NFS protocol change attribute is a nfsd problem,
not a local filesystem problem.

If you're worried about maximum number of writes outstanding vs
i_version bumps that are held in memory, then *bound the maximum
number of uncommitted i_version changes that the NFS server will
allow to build up in memory*. By moving the crash counter to being a
NFS server only function, the NFS server controls the entire
algorithm and it doesn't have to care about external 3rd party
considerations like local filesystems have to.

e.g. The NFS server can track the i_version values when the NFSD
syncs/commits a given inode. The nfsd can sample i_version it when
calls ->commit_metadata or flushed data on the inode, and then when
it peeks at i_version when gathering post-op attrs (or any other
getattr op) it can decide that there is too much in-memory change
(e.g. 10,000 counts since last sync) and sync the inode.

i.e. the NFS server can trivially cap the maximum number of
uncommitted NFS change attr bumps it allows to build up in memory.
At that point, the NFS server has a bound "maximum write count" that
can be used in conjunction with the xattr based crash counter to
determine how the change_attr is bumped by the crash counter.

-Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx