Re: [PATCH] vfs: Fix potential circular locking through setxattr() and removexattr()

From: Jan Kara
Date: Wed Jul 31 2024 - 14:17:12 EST


On Mon 29-07-24 17:28:22, Christian Brauner wrote:
> On Wed, Jul 24, 2024 at 03:30:09PM GMT, Jan Kara wrote:
> > On Tue 23-07-24 14:57:46, David Howells wrote:
> > > Then (2) on the other side, you have a read or a write to the network
> > > filesystem through netfslib which may invoke the cache, which may require
> > > cachefiles to check the xattr on the cache file and maybe set/remove it -
> > > which requires the sb_writers lock on the cache filesystem.
> > >
> > > So if ->read_folio(), ->readahead() or ->writepages() can ever be called with
> > > mm->mmap_lock or vma->vm_lock held, netfslib may call down to cachefiles and
> > > ultimately, it should[*] then take the sb_writers lock on the backing
> > > filesystem to perform xattr manipulation.
> >
> > Well, ->read_folio() under mm->mmap_lock is a standard thing to happen in a
> > page fault. Now grabbing sb_writers (of any filesystem) in that path is
> > problematic and can deadlock though:
> >
> > page fault
> > grab mm->mmap_lock
> > filemap_fault()
> > if (unlikely(!folio_test_uptodate(folio))) {
> > filemap_read_folio() on fs A
> > now if you grab sb_writers on fs B:
> > freeze_super() on fs B write(2) on fs B
> > sb_start_write(fs B)
> > sb->s_writers.frozen = SB_FREEZE_WRITE;
> > sb_wait_write(sb, SB_FREEZE_WRITE);
> > - waits for write
> > sb_start_write(fs B) - blocked behind freeze_super()
> > generic_perform_write()
> > fault_in_iov_iter_readable()
> > page fault
> > grab mm->mmap_lock
> > => deadlock
> >
> > Now this is not the deadlock your lockdep trace is showing but it is a
> > similar one. Like:
> >
> > filemap_invalidate() on fs A freeze_super() on fs B page fault on fs A write(2) on fs B
> > filemap_invalidate_lock() lock mm->mmap_lock sb_start_write(fs B)
> > filemap_fdatawrite_wbc() filemap_fault()
> > afs_writepages() filemap_invalidate_lock_shared()
> > cachefiles_issue_write() => blocks behind filemap_invalidate()
> > sb->s_writers.frozen = SB_FREEZE_WRITE;
> > sb_wait_write(sb, SB_FREEZE_WRITE);
> > => blocks behind write(2)
> > sb_start_write() on fs B
> > => blocks behind freeze_super()
> > generic_perform_write()
> > fault_in_iov_iter_readable()
> > page fault
> > grab mm->mmap_lock
> > => deadlock
> >
> > So I still maintain that grabbing sb_start_write() from quite deep within
> > locking hierarchy (like from writepages when having pages locked, but even
> > holding invalidate_lock is enough for the problems) is problematic and
> > prone to deadlocks.
> >
> > > [*] I say "should" because at the moment cachefiles calls vfs_set/removexattr
> > > functions which *don't* take this lock (which is a bug). Is this an error
> > > on the part of vfs_set/removexattr()? Should they take this lock
> > > analogously with vfs_truncate() and vfs_iocb_iter_write()?
> > >
> > > However, as it doesn't it manages to construct a locking chain via the
> > > mapping.invalidate_lock, the afs vnode->validate_lock and something in execve
> > > that I don't exactly follow.
> > >
> > >
> > > I wonder if this is might be deadlockable by a multithreaded process (ie. so
> > > they share the mm locks) where one thread is writing to a cached file whilst
> > > another thread is trying to set/remove the xattr on that file.
> >
> > Yep, see above. Now the hard question is how to fix this because what you
> > are doing seems to be inherent in how cachefiles fs is designed to work.
>
> So one idea may be to create a private mount for cachefiles and then claim
> write access when that private mount is created and retaining that write access
> for the duration of cachefiles being run. See my draft patch below.

OK, that would deal with one problem but the fundamental knot in this coil
is mmap_lock. I don't see the exact call stack that gets us to xattr code
but I can see e.g.:

filemap_invalidate() - grabs invalidate_lock on AFS inode
afs_writepages
netfs_writepages
netfs_write_folio
cachefiles_issue_write
vfs_fallocate()
- grabs i_rwsem on backing fs inode

Which again is taking locks out of order. That would be no problem because
these are on different filesystems (AFS vs backing fs) but if you have
process with two threads, one doing page fault on AFS, another doing
write(2) to backing fs, their mmap_lock will tie the locking hierarchies on
these two filesystems together. Like:

filemap_invalidate() on fs A page fault on fs A write(2) on fs B
filemap_invalidate_lock() lock mm->mmap_lock
filemap_fdatawrite_wbc() filemap_fault()
afs_writepages() filemap_invalidate_lock_shared()
cachefiles_issue_write() => blocks behind filemap_invalidate()
vfs_fallocate() on fs B
inode_lock(inode on fs B)
generic_perform_write()
fault_in_iov_iter_readable()
page fault
grab mm->mmap_lock
=> blocks behind page fault
inode_lock(inode on fs B)
=> deadlock

So I don't think there's easy way to completely avoid these deadlocks.
Realistically, I don't think that a process taking a page fault on upper fs
while doing write on lower fs is that common but it's certainly a DoS
vector. Also it's kind of annoying because of the lockdep splats in testing
and resulting inability to find other locking issues (as lockdep gets
disabled).

To fix this, either we'd have to keep the lower cache filesystem private to
cachefiles (but I don't think that works with the usecases) or we have to
somehow untangle this mmap_lock knot. This "page fault does quite some fs
locking under mmap_lock" problem is not causing filesystems headaches for
the first time. I would *love* to be able to always drop mmap_lock in the
page fault handler, fill the data into the page cache and then retry the
fault (so that filemap_map_pages() would then handle the fault without
filesystem involvement). It would make many things in filesystem locking
simpler. As far as I'm checking there are now not that many places that
could not handle dropping of mmap_lock during fault (traditionally the
problem is with get_user_pages() / pin_user_pages() users). So maybe this
dream would be feasible after all.

Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR