Re: [PATCH v8 0/5] fs: multigrain timestamps for XFS's change_cookie
From: Dave Chinner
Date: Mon Sep 25 2023 - 18:32:58 EST
On Mon, Sep 25, 2023 at 06:14:05AM -0400, Jeff Layton wrote:
> On Mon, 2023-09-25 at 08:18 +1000, Dave Chinner wrote:
> > On Sat, Sep 23, 2023 at 05:52:36PM +0300, Amir Goldstein wrote:
> > > On Sat, Sep 23, 2023 at 1:46 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > >
> > > > On Sat, 2023-09-23 at 10:15 +0300, Amir Goldstein wrote:
> > > > > On Fri, Sep 22, 2023 at 8:15 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > > > >
> > > > > > My initial goal was to implement multigrain timestamps on most major
> > > > > > filesystems, so we could present them to userland, and use them for
> > > > > > NFSv3, etc.
> > > > > >
> > > > > > With the current implementation however, we can't guarantee that a file
> > > > > > with a coarse grained timestamp modified after one with a fine grained
> > > > > > timestamp will always appear to have a later value. This could confuse
> > > > > > some programs like make, rsync, find, etc. that depend on strict
> > > > > > ordering requirements for timestamps.
> > > > > >
> > > > > > The goal of this version is more modest: fix XFS' change attribute.
> > > > > > XFS's change attribute is bumped on atime updates in addition to other
> > > > > > deliberate changes. This makes it unsuitable for export via nfsd.
> > > > > >
> > > > > > Jan Kara suggested keeping this functionality internal-only for now and
> > > > > > plumbing the fine grained timestamps through getattr [1]. This set takes
> > > > > > a slightly different approach and has XFS use the fine-grained attr to
> > > > > > fake up STATX_CHANGE_COOKIE in its getattr routine itself.
> > > > > >
> > > > > > While we keep fine-grained timestamps in struct inode, when presenting
> > > > > > the timestamps via getattr, we truncate them at a granularity of number
> > > > > > of ns per jiffy,
> > > > >
> > > > > That's not good, because user explicitly set granular mtime would be
> > > > > truncated too and booting with different kernels (HZ) would change
> > > > > the observed timestamps of files.
> > > > >
> > > >
> > > > Thinking about this some more, I think the first problem is easily
> > > > addressable:
> > > >
> > > > The ctime isn't explicitly settable and with this set, we're already not
> > > > truncating the atime. We haven't used any of the extra bits in the mtime
> > > > yet, so we could just carve out a flag in there that says "this mtime
> > > > was explicitly set and shouldn't be truncated before presentation".
> > > >
> > >
> > > I thought about this option too.
> > > But note that the "mtime was explicitly set" flag needs
> > > to be persisted to disk so you cannot store it in the high nsec bits.
> > > At least XFS won't store those bits if you use them - they have to
> > > be translated to an XFS inode flag and I don't know if changing
> > > XFS on-disk format was on your wish list.
> >
> > Remember: this multi-grain timestamp thing was an idea to solve the
> > NFS change attribute problem without requiring *any* filesystem with
> > sub-jiffie timestamp capability to change their on-disk format to
> > implement a persistent change attribute that matches the new
> > requires of the kernel nfsd.
> >
> > If we now need to change the on-disk format to support
> > some whacky new timestamp semantic to do this, then people have
> > completely lost sight of what problem the multi-grain timestamp idea
> > was supposed to address.
> >
>
> Yep. The main impetus for all of this was to fix XFS's change attribute
> without requiring an on-disk format change. If we have to rev the on-
> disk format, we're probably better off plumbing in a proper i_version
> counter and tossing this idea aside.
>
> That said, I think all we'd need for this scheme is a single flag per
> inode (to indicate that the mtime shouldn't be truncated before
> presentation). If that's possible to do without fully revving the inode
> format, then we could still pursue this. I take it that's probably not
> the case though.
Older kernels that don't know what the flag means, but that should
be OK for an inode flag. The bigger issue is that none of the
userspace tools (xfs_db, xfs_repair, etc) know about it, so they
would have to be taught about it. And then there's testing it, which
likely means userspace needs visibility of the flag (e.g. FS_XFLAG
for it) and then there's more work....
It's really not worth it.
I think that Linus's suggestion of the in-memory inode timestamp
always being a 64bit, 100ns granularity value instead of a timespec
that gets truncated at sample time has merit as a general solution.
We also must not lose sight of the fact that the lazytime mount
option makes atime updates on XFS behave exactly as the nfsd/NFS
client application wants. That is, XFS will do in-memory atime
updates unless the atime update also sets S_VERSION to explicitly
bump the i_version counter if required. That leads to another
potential nfsd specific solution without requiring filesystems to
change on disk formats: the nfsd explicitly asks operations for lazy
atime updates...
And we must also keep in sight the fact that io_uring wants
non-blocking timestamp updates to be possible (for all types of
updates). Hence it looks to me like we have more than one use case
for per-operation/application specific timestamp update semantics.
Perhaps there's a generic solution to this problem (e.g. operation
specific non-blocking, in-memory pure timestamp updates) that does
what everyone needs...
-Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx