Re: [PATCH v4 2/9] fs: add infrastructure for multigrain inode i_m/ctime

From: Jeff Layton
Date: Tue May 23 2023 - 06:40:18 EST


On Tue, 2023-05-23 at 12:02 +0200, Jan Kara wrote:
> On Thu 18-05-23 07:47:35, Jeff Layton wrote:
> > The VFS always uses coarse-grained timestamp updates for filling out the
> > ctime and mtime after a change. This has the benefit of allowing
> > filesystems to optimize away a lot metadata updates, down to around 1
> > per jiffy, even when a file is under heavy writes.
> >
> > Unfortunately, this has always been an issue when we're exporting via
> > NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a
> > lot of exported filesystems don't properly support a change attribute
> > and are subject to the same problems with timestamp granularity. Other
> > applications have similar issues (e.g backup applications).
> >
> > Switching to always using fine-grained timestamps would improve the
> > situation, but that becomes rather expensive, as the underlying
> > filesystem will have to log a lot more metadata updates.
> >
> > What we need is a way to only use fine-grained timestamps when they are
> > being actively queried.
> >
> > The kernel always stores normalized ctime values, so only the first 30
> > bits of the tv_nsec field are ever used. Whenever the mtime changes, the
> > ctime must also change.
> >
> > Use the 31st bit of the ctime tv_nsec field to indicate that something
> > has queried the inode for the i_mtime or i_ctime. When this flag is set,
> > on the next timestamp update, the kernel can fetch a fine-grained
> > timestamp instead of the usual coarse-grained one.
> >
> > This patch adds the infrastructure this scheme. Filesytems can opt
> > into it by setting the FS_MULTIGRAIN_TS flag in the fstype.
> >
> > Later patches will convert individual filesystems over to use it.
> >
> > Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx>
>
> So there are two things I dislike about this series because I think they
> are fragile:
>
> 1) If we have a filesystem supporting multigrain ts and someone
> accidentally directly uses the value of inode->i_ctime, he can get bogus
> value (with QUERIED flag). This mistake is very easy to do. So I think we
> should rename i_ctime to something like __i_ctime and always use accessor
> function for it.
>

We could do this, but it'll be quite invasive. We'd have to change any
place that touches i_ctime (and there are a lot of them), even on
filesystems that are not being converted.

> 2) As I already commented in a previous version of the series, the scheme
> with just one flag for both ctime and mtime and flag getting cleared in
> current_time() relies on the fact that filesystems always do an equivalent
> of:
>
> inode->i_mtime = inode->i_ctime = current_time();
>
> Otherwise we can do coarse grained update where we should have done a fine
> grained one. Filesystems often update timestamps like this but not
> universally. Grepping shows some instances where only inode->i_mtime is set
> from current_time() e.g. in autofs or bfs. Again a mistake that is rather
> easy to make and results in subtle issues. I think this would be also
> nicely solved by renaming i_ctime to __i_ctime and using a function to set
> ctime. Mtime could then be updated with inode->i_mtime = ctime_peek().
>
> I understand this is quite some churn but a very mechanical one that could
> be just done with Coccinelle and a few manual fixups. So IMHO it is worth
> the more robust result.

AFAICT, under POSIX, you must _always_ set the ctime when you set the
mtime, but the reverse is not true. That's why keeping the flag in the
ctime makes sense. If we're updating the mtime, then we necessarily must
update the ctime.

> Some more nits below.
>
> > +/**
> > + * current_mg_time - Return FS time (possibly fine-grained)
> > + * @inode: inode.
> > + *
> > + * Return the current time truncated to the time granularity supported by
> > + * the fs, as suitable for a ctime/mtime change. If the ctime is flagged
> > + * as having been QUERIED, get a fine-grained timestamp.
> > + */
>
> The comment should also mention that QUERIED flag is cleared from the ctime.
>

Fair point. I can fix that up.

> > +static struct timespec64 current_mg_time(struct inode *inode)
> > +{
> > + struct timespec64 now;
> > + atomic_long_t *pnsec = (atomic_long_t *)&inode->i_ctime.tv_nsec;
> > + long nsec = atomic_long_fetch_andnot(I_CTIME_QUERIED, pnsec);
> > +
> > + if (nsec & I_CTIME_QUERIED) {
> > + ktime_get_real_ts64(&now);
> > + } else {
> > + struct timespec64 ctime;
> > +
> > + ktime_get_coarse_real_ts64(&now);
> > +
> > + /*
> > + * If we've recently fetched a fine-grained timestamp
> > + * then the coarse-grained one may still be earlier than the
> > + * existing one. Just keep the existing ctime if so.
> > + */
> > + ctime = ctime_peek(inode);
> > + if (timespec64_compare(&ctime, &now) > 0)
> > + now = ctime;
> > + }
> > +
> > + return now;
> > +}
> > +
>
> ...
>
> > +/**
> > + * ctime_nsec_peek - peek at (but don't query) the ctime tv_nsec field
> > + * @inode: inode to fetch the ctime from
> > + *
> > + * Grab the current ctime tv_nsec field from the inode, mask off the
> > + * I_CTIME_QUERIED flag and return it. This is mostly intended for use by
> > + * internal consumers of the ctime that aren't concerned with ensuring a
> > + * fine-grained update on the next change (e.g. when preparing to store
> > + * the value in the backing store for later retrieval).
> > + *
> > + * This is safe to call regardless of whether the underlying filesystem
> > + * is using multigrain timestamps.
> > + */
> > +static inline long ctime_nsec_peek(const struct inode *inode)
> > +{
> > + return inode->i_ctime.tv_nsec &~ I_CTIME_QUERIED;
>
> This is somewhat unusual spacing. I'd use:
>
> inode->i_ctime.tv_nsec & ~I_CTIME_QUERIED
>

Yeah, I don't know what happened there. I'll fix that up.

> > +}
> > +
> > +/**
> > + * ctime_peek - peek at (but don't query) the ctime
> > + * @inode: inode to fetch the ctime from
> > + *
> > + * Grab the current ctime from the inode, sans I_CTIME_QUERIED flag. For
> > + * use by internal consumers that don't require a fine-grained update on
> > + * the next change.
> > + *
> > + * This is safe to call regardless of whether the underlying filesystem
> > + * is using multigrain timestamps.
> > + */
> > +static inline struct timespec64 ctime_peek(const struct inode *inode)
> > +{
> > + struct timespec64 ctime;
> > +
> > + ctime.tv_sec = inode->i_ctime.tv_sec;
> > + ctime.tv_nsec = ctime_nsec_peek(inode);
> > +
> > + return ctime;
> > +}
>
> Given this is in a header that gets included in a lot of places, maybe we
> should call it like inode_ctime_peek() or inode_ctime_get() to reduce
> chances of a name clash?

I'd be fine with that, but "ctime" sort of implies inode->i_ctime to me.
We don't really use that nomenclature elsewhere.

--
Jeff Layton <jlayton@xxxxxxxxxx>