Re: [PATCH 1/4] statx: Add a system call to make enhanced file info available

From: Jeff Layton
Date: Fri Nov 18 2016 - 13:54:35 EST

On Fri, 2016-11-18 at 18:04 +0000, David Howells wrote:
> Jeff Layton <jlayton@xxxxxxxxxxxxxxx> wrote:
> > > We've already been through that. I wanted to call it stx_data_version but
> > > that got argued down to stx_version. The problem is that what the version
> > > number means is entirely filesystem dependent, and it might not just reflect
> > > changes in the data.
> > >
> >
> > It had better not just reflect data changes.
> >
> > knfsd populates the NFSv4 change attribute from inode->i_version. It
> > _must_ have changed between subsequent queries if either the data or
> > metadata has changed (basically whenever you would update either the
> > ctime or the mtime).
> No, I think it *should* just reflect the data changes - otherwise you have
> have to burn your cached data unnecessarily.
> > > > So if stx_version this is intended to export the internal filesystem
> > > > inode change counter (i.e. inode->i_version) then lets call it that:
> > > > stx_modification_count. It's clear and unambiguous as to what it
> > > > represents, especially as this counter is more than just a "data
> > > > modification" counter - inode metadata modifications will also
> > > > cause it to change....
> > >
> > > I disagree that it's unambiguous. It works like mtime, right?
> >
> > More like ctime + mtime mashed together.
> Isn't ctime updated every time mtime is? In which case stx_change_count would
> be a better name.
> > > Which wouldn't be of use for certain filesystems. An example of this
> > > would be AFS, where it's incremented by 1 each time a write is committed,
> > > but is not updated for metadata changes. This is what matters for data
> > > caching.
> > >
> >
> > No. Basically the rules are that if something in the inode data or
> > metadata changed, then it must be a "larger" value (also accounting for
> > wraparound). So you also need to change it (usually by incrementing it)
> > when doing namespace changes that involve it (renames, unlinks, etc.).
> That's entirely filesystem dependent.

My mistake. I had thought that i_version was only used for NFSv4, and a
few internal callers (particularly, some readdir implementations). I
didn't realize that AFS also uses it.

> A better rule is that if you do a write and then compare the data version you
> got back to the version you had before; if it's increased by exactly one,
> there were no other writes between your last retrieval of the attributes and
> your write that just got committed. Admittedly, this assumes that the server
> serialises writes to a particular file.
> If the value just increases, you don't know that didn't happen by this
> mechanism, so the version is of limited value.

For the case of NFSv4, you can't infer that anyway. The protocol pretty
much states that the client has to treat this value as semi-opaque. It
can't infer anything other than "something has changed" (though it can
look to see if a change attribute is "old" and discard it).

Does AFS allow you to infer something from the actual value?

Now that I realize that AFS has very different semantics, we might want
to step back for a bit on presenting i_version to userspace. I think we
need to come to some agreement on what i_version should actually mean
before we expose it to userland to use.

Maybe we should consider separate stx_change_attr and
stx_data_change_attr fields in here?

> > Adding new fields in later piecemeal patches allows us to demonstrate
> > that that concept actually works.
> You're probably right, but the downside is that we really need some way to
> find out what's supported. On the other hand, we probably need that anyway,
> hence my suggestion of an fsinfo() syscall also.

Yeah, I think we will need an fsinfo call of some sort eventually.

Alternately, we could just add the fields of interest to statx so that
the callers can just query for it with the other fields (e.g.
STATX_TS_GRANULARITY). Just document those attributes as being per-
mount or whatever.

> > > You really think we're going to have accurate timestamps with a resolution
> > > of a millionth of a nanosecond? This means you're going to be doing a
> > > 64-bit division every time you want a nanosecond timestamp.
> >
> > ...
> >
> > Could contemporary machines get away with just shifting down by 32
> > bits?
> A better way would probably be to have:
> struct timestamp {
> __u64 seconds;
> __u32 nanoseconds;
> __u32 femtoseconds;
> };
> where you effectively add all the fields together with appropriate
> multipliers.

Harder for those femtosecond scale machines to deal with, but maybe
you're right.

If the plan is to do that then we can just punt that out until that
need arises, and add stx_?time_fsec fields at that point. The ugliness
would all be hidden behind the glibc wrapper anyway (in principle).

> But I still wonder if we really are going to move to femtosecond timestamps,
> given that that's going to involve clock frequencies well in excess of 1 THz
> to be useful. Even attoseconds is probably unnecessary, given that clock
> frequencies don't seem to be moving much beyond a few GHz, though it's
> reasonable that we could have a timestamp counter that has an attosecond
> period - it's just that the processing time to deal with it seems likely to
> render it unnecessary.
> David

True, and we have to figure here that these are _file_ times, so you'd
have to have file updates happening on sub-nanosecond timescales for
this to make sense as well.

I don't think we really need to add this now, especially if we can add
the extra fields if/when the need ever arises.

Jeff Layton <jlayton@xxxxxxxxxx>