Re: [PATCH] vfs: check dentry is still valid in get_link()
From: Brian Foster
Date: Wed Jan 19 2022 - 09:08:33 EST
On Wed, Jan 19, 2022 at 10:25:47AM +1100, Dave Chinner wrote:
> On Tue, Jan 18, 2022 at 05:58:14AM +0000, Al Viro wrote:
> > On Tue, Jan 18, 2022 at 03:12:53PM +1100, Dave Chinner wrote:
> >
> > > No, that just creates a black hole where the VFS inode has been
> > > destroyed but the XFS inode cache doesn't know it's been trashed.
> > > Hence setting XFS_IRECLAIMABLE needs to remain in the during
> > > ->destroy_inode, otherwise the ->lookup side of the cache will think
> > > that are currently still in use by the VFS and hand them straight
> > > back out without going through the inode recycling code.
> > >
> > > i.e. XFS_IRECLAIMABLE is the flag that tells xfs_iget() that the VFS
> > > part of the inode has been torn down, and that it must go back
> > > through VFS re-initialisation before it can be re-instantiated as a
> > > VFS inode.
> >
> > OK...
> >
> > > It would also mean that the inode will need to go through two RCU
> > > grace periods before it gets reclaimed, because XFS uses RCU
> > > protected inode cache lookups internally (e.g. for clustering dirty
> > > inode writeback) and so freeing the inode from the internal
> > > XFS inode cache requires RCU freeing...
> >
> > Wait a minute. Where is that RCU delay of yours, relative to
> > xfs_vn_unlink() and xfs_vn_rename() (for target)?
>
> Both of those drop the inode on an on-disk unlinked list. When the
> last reference goes away, ->destroy_inode then runs inactivation.
>
> Inactivation then runs transactions to free all the space attached
> to the inode and then removes the inode from the unlinked list and
> frees it. It then goes into the XFS_IRECLAIMABLE state and is dirty
> in memory. It can't be reclaimed until the inode is written to disk
> or the whole inode cluster is freed and the inode marked XFS_ISTALE
> (so won't get written back).
>
> At that point, a background inode reclaim thread (runs every 5s)
> does a RCU protected lockless radix tree walk to find
> XFS_IRECLAIMABLE inodes (via radix tree tags). If they are clean, it
> moves them to XFS_IRECLAIM state, deletes them from the radix tree
> and frees them via a call_rcu() callback.
>
> If memory reclaim comes along sooner than this, the
> ->free_cached_objects() superblock shrinker callback runs that RCU
> protected lockless radix tree walk to find XFS_IRECLAIMABLE inodes.
>
> > And where does
> > it happen in case of e.g. open() + unlink() + close()?
>
> Same thing - close() drops the last reference, the unlinked inode
> goes through inactivation, then moves into the XFS_IRECLAIMABLE
> state.
>
> The problem is not -quite- open-unlink-close. The problem case is
> the reallocation of an on-disk inode in the case of
> unlink-close-open(O_CREATE) operations because of the on-disk inode
> allocator policy of aggressive reuse of recently freed inodes. In
> that case the xfs_iget() lookup will reinstantiate the inode via
> xfs_iget_recycle() and the inode will change identity between VFS
> instantiations.
>
> This is where a RCU grace period is absolutely required, and we
> don't currently have one. The bug was introduced with RCU freeing of
> inodes (what, 15 years ago now?) and it's only recently that we've
> realised this bug exists via code inspection. We really have no
> evidence that it's actually been tripped over in the wild....
>
To be fair, we have multiple reports of the NULL ->get_link() variant
and my tests to this point to induce "unexpected" returns of that
function don't manifest in as catastrophic a side effect, so might not
be as immediately noticeable by users. I.e., returning non-string data
doesn't seem to necessarily cause a crash in the vfs and the symlink to
symlink variant is more of an unexpected redirection of a lookup.
IOW, I think it's fairly logical to assume that if users are hitting the
originally reported problem, they're likely dangerously close to the
subsequent problems that have been identified from further inspection of
the related code. I don't think this is purely a case of a "theoretical"
problem that doesn't warrant some form of prioritized fix.
> Unfortunately, the simple fix of adding syncronize_rcu() to
> xfs_iget_recycle() causes significant performance regressions
> because we hit this path quite frequently when workloads use lots of
> temporary files - the on-disk inode allocator policy tends towards
> aggressive re-use of inodes for small sets of temporary files.
>
> The problem XFS is trying to address is that the VFS inode lifecycle
> does not cater for filesystems that need to both dirty and then
> clean unlinked inodes between iput_final() and ->destroy_inode. It's
> too late to be able to put the inode back on the LRU once we've
> decided to drop the inode if we need to dirty it again. ANd because
> evict() is part of the non-blocking memory reclaim, we aren't
> supposed to block for arbitrarily long periods of time or create
> unbound memory demand processing inode eviction (both of which XFS
> can do in inactivation).
>
> IOWs, XFS can't free the inode until it's journal releases the
> internal reference on the dirty inode. ext4 doesn't track inodes in
> it's journal - it only tracks inode buffers that contain the changes
> made to the inode, so once the transaction is committed in
> ext4_evict_inode() the inode can be immediately freed via either
> ->destroy_inode or ->free_inode. That option does not exist for XFS
> because we have to wait for the journal to finish with the inode
> before it can be freed. Hence all the background reclaim stuff.
>
> We've recently solved several of the problems we need to solve to
> reduce the mismatch; avoiding blocking on inode writeback in reclaim
> and background inactivation are two of the major pieces of work we
> needed done before we could even consider more closely aligning XFS
> to the VFS inode cache life cycle model.
>
The background inactivation work facilitates an incremental improvement
by nature because destroyed inodes go directly to a queue instead of
being processed synchronously. My most recent test to stamp the grace
period info at inode destroy time and conditionally sync at reuse time
shows pretty much no major cost because the common case is that a grace
period has already expired by the time the queue populates, is processed
and said inodes become reclaimable and reallocated. To go beyond just
the performance result, if I open code the conditional sync for tracking
purposes I only see something like 10-15 rcu waits out of the 36k
allocation cycles. If I increase the background workload 4x, the
allocation rate drops to ~33k cycles (which is still pretty much in line
with baseline) and the rcu sync count increases to 70, which again is
relatively nominal over tens of thousands of cycles.
This all requires some more thorough testing, but I'm sure it won't be
absolutely free for every possible workload or environment. But given
that we know this infrastructure is fundamentally broken (by subtle
compatibilities between XFS and the VFS that have evolved over time),
will require some thought and time to fix properly in the filesystem,
that users are running into problems very closely related to it, why not
try to address the fundamental breakage if we can do so with an isolated
change with minimal (but probably not zero) performance impact?
I agree that the unconditional synchronize_rcu() on reuse approach is
just not viable, but so far tests using cond_synchronize_rcu() seem
fairly reasonable. Is there some other problem or concern with such an
approach?
Brian
> The next step is to move the background inode inactivation triggers
> up into ->drop_inode so we can catch inodes that need to be dirtied
> by the filesysetm before they have been marked for eviction by the
> VFS. This will allow us to keep the inode on the VFS LRU (probably
> marked with I_WILL_FREE so everyone else keeps away from it) whilst
> we are waiting for the background inactivation work to be done, the
> journal flushed and the metadata written back. Once clean, we can
> directly evict the inode from the VFS ourselves.
>
> This would mean we only get clean, reclaimable inodes hitting the
> evict() path, and so at that point we can just remove the inode
> directly from the XFS inode cache from either ->destroy_inode or
> ->free_inode and RCU free it. The recycling of in-memory inodes in
> xfs_iget_cache_hit can go away entirely because no inodes will
> linger in the XFS inode cache without being visible at the VFS
> layer as they do now...
>
> That's going to take a fair bit of work to realise, and I'm not sure
> yet exactly what mods are going to be needed to either the VFS inode
> infrastructure or the XFS inode cache.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx
>