Re: xfs: commit 6552321831dc "xfs: remove i_iolock and use i_rwsem in the VFS inode instead" change causes hang
From: James Bottomley
Date: Sun Jan 08 2017 - 13:57:45 EST
On Sun, 2017-01-08 at 19:18 +0100, Christoph Hellwig wrote:
> On Sun, Jan 08, 2017 at 09:59:25AM -0800, James Bottomley wrote:
> > Hey, that's not really true: the inode lock (i_rwsem) is used in
> > all sorts of generic places, including generic_file_write_iter().
> > That's, I think, why ima is using it to try to prevent writes
> > while it measures the file.
>
> But all these are _below_ file_operations. The only place where take
> them in the VFS is for namespace locking, e.g. before calling into
> inode_operations (to generalize a little).
Definitely agree we need an abstraction with defined semantics.
> > > So the answer here is that ima needs to stop playing with
> > > i_rwsem.
> >
> > Isn't there a happy medium? most sensible filesystems will allow
> > shared reading (unless they want to tank performance) so we can
> > rely on the fact that even if a fs does use i_rwsem internally on
> > the read path, it will have to be shared.
>
> At least for direct I/O that doesn't always have to be true.
I'm unsure about the DIO case, so lets try defining the semantics and
see if they're implementable for DIO, otherwise simply exclude it.
> > So simply replacing the inode_lock() in ima
> > with inode_lock_shared() should do what ima wants and not interact
> > badly even if the underlying FS uses i_rwsem. If there's ever a FS
> > that takes it exclusively in the read path, ima can simply
> > blacklist
> > it.
>
> IFF we actually allow recursive readers for rw_semaphores this would
> work around the issue (but I'm not sure about that fact, at least
> in the past we didn't). It won't fix IMA for all the file systems
> use other synchronization for reads, e.g. the cluster locks in ocfs2
> or gfs2. It won't fix NFS which will exhibit exacly the same issue
> as Mimi reported.
>
> Last but not least it won't solve the problem that IMA has never been
> designed and does neither document the requires it has from a file
> system, nor is there any systematic testing for it. It will keep on
> breaking because it has all kinds of weird implicit assumptions never
> written down or verified, and the test coverage for it is basically
> non-existing.
OK, so how about we define it. I think we need two vfs calls:
inode_block_local_writes(inode)
inode_unblock_local_writes(inode)
With semantics that between these two, all write attempts to the file
backed by the inode on this system block but reads of the underlying
file are allowed (I added local so we don't have to implement for
remote filesystems). inode_block_local_writes() will block until all
local writes to the file have finished, so you're guaranteed the file
only allows reads when it succeeds.
As for implementation in the vfs, I suspect an outstanding write count
in the inode might be the better way?
James