Re: [man-pages RFC PATCH v4] statx, inode: document the new STATX_INO_VERSION field

From: Jeff Layton
Date: Tue Sep 13 2022 - 15:03:23 EST


On Tue, 2022-09-13 at 11:15 +1000, Dave Chinner wrote:
> On Tue, Sep 13, 2022 at 09:29:48AM +1000, NeilBrown wrote:
> > On Mon, 12 Sep 2022, Jeff Layton wrote:
> > > On Sun, 2022-09-11 at 08:53 +1000, NeilBrown wrote:
> > > > This could be expensive.
> > > >
> > > > There is not currently any locking around O_DIRECT writes. You cannot
> > > > synchronise with them.
> > > >
> > >
> > > AFAICT, DIO write() implementations in btrfs, ext4, and xfs all hold
> > > inode_lock_shared across the I/O. That was why patch #8 takes the
> > > inode_lock (exclusive) across the getattr.
> >
> > Looking at ext4_dio_write_iter() it certain does take
> > inode_lock_shared() before starting the write and in some cases it
> > requests, using IOMAP_DIO_FORCE_WAIT, that imap_dio_rw() should wait for
> > the write to complete. But not in all cases.
> > So I don't think it always holds the shared lock across all direct IO.
>
> To serialise against dio writes, one must:
>
> // Lock the inode exclusively to block new DIO submissions
> inode_lock(inode);
>
> // Wait for all in flight DIO reads and writes to complete
> inode_dio_wait(inode);
>
> This is how truncate, fallocate, etc serialise against AIO+DIO which
> do not hold the inode lock across the entire IO. These have to
> serialise aginst DIO reads, too, because we can't have IO in
> progress over a range of the file that we are about to free....
>

Thanks, that clarifies a bit.

> > > Taking inode_lock_shared is sufficient to block out buffered and DAX
> > > writes. DIO writes sometimes only take the shared lock (e.g. when the
> > > data is already properly aligned). If we want to ensure the getattr
> > > doesn't run while _any_ writes are running, we'd need the exclusive
> > > lock.
> >
> > But the exclusive lock is bad for scalability.
>
> Serilisation against DIO is -expensive- and -slow-. It's not a
> solution for what is supposed to be a fast unlocked read-only
> operation like statx().
>

Fair enough. I labeled that patch with RFC as I suspected that it would
be too expensive. I don't think we can guarantee perfect consistency vs.
mmap either, so carving out DIO is not a stretch (at least not for
NFSv4).

> > > Maybe that's overkill, though it seems like we could have a race like
> > > this without taking inode_lock across the getattr:
> > >
> > > reader writer
> > > -----------------------------------------------------------------
> > > i_version++
> > > getattr
> > > read
> > > DIO write to backing store
> > >
> >
> > This is why I keep saying that the i_version increment must be after the
> > write, not before it.
>
> Sure, but that ignores the reason why we actually need to bump
> i_version *before* we submit a DIO write. DIO write invalidates the
> page cache over the range of the write, so any racing read will
> re-populate the page cache during the DIO write.
>
> Hence buffered reads can return before the DIO write has completed,
> and the contents of the read can contain, none, some or all of the
> contents of the DIO write. Hence i_version has to be incremented
> before the DIO write is submitted so that racing getattrs will
> indicate that the local caches have been invalidated and that data
> needs to be refetched.
>

Bumping the change attribute after the write is done would be sufficient
for serving NFSv4. The clients just invalidate their caches if they see
the value change. Bumping it before and after would be fine too. We
might get some spurious cache invalidations but they'd be infrequent.

FWIW, we've never guaranteed any real atomicity with NFS readers vs.
writers. Clients may see the intermediate stages of a write from a
different client if their reads race in at the right time. If you need
real atomicity, then you really should be using locking. What we _do_
try to ensure is timely pagecache invalidation when this occurs.

If we want to expose this to userland via statx in the future, then we
may need a stronger guarantee because we can't as easily predict how
people will want to use this.

At that point, bumping i_version both before and after makes a bit more
sense, since it better ensures that a change will be noticed, whether
the related read op comes before or after the statx.

> But, yes, to actually be safe here, we *also* should be bumping
> i_version on DIO write on DIO write completion so that racing
> i_version and data reads that occur *after* the initial i_version
> bump are invalidated immediately.
>
> IOWs, to avoid getattr/read races missing stale data invalidations
> during DIO writes, we really need to bump i_version both _before and
> after_ DIO write submission.
>
> It's corner cases like this where "i_version should only be bumped
> when ctime changes" fails completely. i.e. there are concurrent IO
> situations which can only really be handled correctly by bumping
> i_version whenever either in-memory and/or on-disk persistent data/
> metadata state changes occur.....

I think we have two choices (so far) when it comes to closing the race
window between the i_version bump and the write. Either should be fine
for serving NFSv4.

1/ take the inode_lock in some form across the getattr call for filling
out GETATTR/READDIR/NVERIFY info. This is what the RFC patch in my
latest set does. That's obviously too expensive though. We could take
inode_lock_shared, which wouldn't exclude DIO, but would cover the
buffered and DAX codepaths. This is somewhat ugly though, particularly
with slow backend network filesystems (like NFS). That getattr could
take a while, and meanwhile all writes are stuck...

...or...

2/ start bumping the i_version after a write completes. Bumping it twice
(before and after) would be fine too. In most cases the second one will
be a no-op anyway. We might get the occasional false cache invalidations
there with NFS, but they should be pretty rare and that's preferable to
holding on to invalid cached data (which I think is a danger today).

To do #2, I guess we'd need to add an inode_maybe_inc_iversion call at
the end of the relevant ->write_iter ops, and then dirty the inode if
that comes back true? That should be pretty rare.

We do also still need some way to mitigate potential repeated versions
due to crashes, but that's orthogonal to the above issue (and being
discussed in a different branch of this thread).
--
Jeff Layton <jlayton@xxxxxxxxxx>