Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing

From: Dave Chinner
Date: Mon Oct 23 2023 - 19:26:41 EST

Next message: Ivan Babrou: "Re: wait_for_unix_gc can cause CPU overload for well behaved programs"
Previous message: Nicolin Chen: "Re: [PATCH v3] iommufd: Only enforce cache coherency in iommufd_hw_pagetable_alloc"
In reply to: Jeff Layton: "Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing"
Next in thread: Linus Torvalds: "Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, Oct 23, 2023 at 10:45:21AM -0400, Jeff Layton wrote:
> On Mon, 2023-10-23 at 09:17 +1100, Dave Chinner wrote:
> > All I'm suggesting is that rather than using mount options for
> > noatime-like behaviour for NFSD accesses, we actually have the nfsd
> > accesses say "we'd like pure atime updates without iversion, please".
> >
> > Keep in mind that XFS does actually try to avoid bumping i_version
> > on pure timestamp updates - we carved that out a long time ago (see
> > the difference in XFS_ILOG_CORE vs XFS_ILOG_TIMESTAMP in
> > xfs_vn_update_time() and xfs_trans_log_inode()) so that we could
> > optimise fdatasync() to ignore timestamp updates that occur as a
> > result of pure data overwrites.
> >
> > Hence XFS only bumps i_version for pure timestamp updates if the
> > iversion queried flag is set. IOWs, XFS it is actually doing exactly
> > what the VFS iversion implementation is telling it to do with
> > timestamp updates for non-core inode metadata updates.
> >
> > That's the fundamental issue here: nfsd has set VFS state that tells
> > the filesystem to "bump iversion on next persistent inode change",
> > but the nfsd then runs operations that can change non-critical
> > persistent inode state in "query-only" operations. It then expects
> > filesystems to know that it should ignore the iversion queried state
> > within this context. However, without external behavioural control
> > flags, filesystems cannot know that an isolated metadata update has
> > context specific iversion behavioural constraints.
>
> > Hence fixing this is purely a VFS/nfsd i_version implementation
> > problem - if the nfsd is running a querying operation, it should
> > tell the filesystem that it should ignore iversion query state. If
> > nothing the application level cache cares about is being changed
> > during the query operation, it should tell the filesystem to ignore
> > iversion query state because it is likely the nfsd query itself will
> > set it (or have already set it itself in the case of compound
> > operations).
> >
> > This does not need XFS on-disk format changes to fix. This does not
> > need changes to timestamp infrastructure to fix. We just need the
> > nfsd application to tell us that we should ignore the vfs i_version
> > query state when we update non-core inode metadata within query
> > operation contexts.
> >
>
> I think you're missing the point of the problem I'm trying to solve.
> I'm not necessarily trying to guard nfsd against its own accesses. The
> reads that trigger an eventual atime update could come from anywhere --
> nfsd, userland accesses, etc.
>
> If you are serving an XFS filesystem, with the (default) relatime mount
> option, then you are guaranteed that the clients will invalidate their
> cache of a file once per day, assuming that at least one read was issued
> against the file during that day.
>
> That read will cause an eventual atime bump to be logged, at which point
> the change attribute will change. The client will then assume that it
> needs to invalidate its cache when it sees that change.
>
> Changing how nfsd does its own accesses won't fix anything, because the
> problematic atime bump can come from any sort of read access.

I'm not missing the point at all - as I've said in the past I don't
think local vs remote access is in any way relevant to the original
problem that needs to be solved. If the local access is within the
relatime window, it won't cause any persistent metadata change at
all. If it's outside the window, then it's no different to the NFS
client reading data from the server outside the window. If it's the
first access after a NFS client side modification, then it's just
really bad timing but it isn't likely to be a common issue.

Hence I just don't think it matters on bit, and we can address the
24 hour problem separately to the original problem that still needs
to be fixed.

The problem is the first read request after a modification has been
made. That is causing relatime to see mtime > atime and triggering
an atime update. XFS sees this, does an atime update, and in
committing that persistent inode metadata update, it calls
inode_maybe_inc_iversion(force = false) to check if an iversion
update is necessary. The VFS sees I_VERSION_QUERIED, and so it bumps
i_version and tells XFS to persist it.

IOWs, XFS is doing exactly what the VFS is telling it to do with
i_version during the persistent inode metadata update that the VFS
told it to make.

This, however, is not the semantics that the *nfsd application*
wants. It does not want i_version to be updated when it is running a
data read operation despite the fact the VFS is telling the
filesystem it needs to be updated.

What we need to know is when the inode is being accessed by the nfsd
so we can change the in-memory timestamp update behaviour
appropriately. We really don't need on-disk format changes - we
just need to know that we're supposed to do something special with
pure timestamp updates because i_version needs to be behave in a
manner compatible with the new NFS requirements....

We also don't need generic timestamp infrastructure changes to do
this - the multi-grained timestamp was a neat idea for generic
filesystem support of the nfsd i_version requirements, but it's
collapsed under the weight of complexity.

There are simpler ways individual filesystems can do the right
thing, but to do that we need to know that nfsd has actively
referenced the inode. How we get that information is what I want to
resolve, the filesystem should be able to handle everything else in
memory....

Perhaps we can extract I_VERSION_QUERIED as a proxy for nfsd
activity on the inode rather than need a per-operation context? Is
that going to be reliable enough? Will that cause problems for other
applications that want to use i_version for their own purposes?

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx

Next message: Ivan Babrou: "Re: wait_for_unix_gc can cause CPU overload for well behaved programs"
Previous message: Nicolin Chen: "Re: [PATCH v3] iommufd: Only enforce cache coherency in iommufd_hw_pagetable_alloc"
In reply to: Jeff Layton: "Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing"
Next in thread: Linus Torvalds: "Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]