Re: [PATCH] xfs: Fix false positive lockdep warning with sb_internal & fs_reclaim

From: Waiman Long
Date: Thu Jan 02 2020 - 13:41:32 EST

On 1/2/20 1:24 PM, Darrick J. Wong wrote:
> On Thu, Jan 02, 2020 at 11:19:51AM -0500, Qian Cai wrote:
>>> On Jan 2, 2020, at 10:52 AM, Waiman Long <longman@xxxxxxxxxx> wrote:
>>> Depending on the workloads, the following circular locking dependency
>>> warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo
>>> lock) may show up:
>>> ======================================================
>>> WARNING: possible circular locking dependency detected
>>> 5.0.0-rc1+ #60 Tainted: G W
>>> ------------------------------------------------------
>>> fsfreeze/4346 is trying to acquire lock:
>>> 0000000026f1d784 (fs_reclaim){+.+.}, at:
>>> fs_reclaim_acquire.part.19+0x5/0x30
>>> but task is already holding lock:
>>> 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650
>>> which lock already depends on the new lock.
>>> :
>>> Possible unsafe locking scenario:
>>> CPU0 CPU1
>>> ---- ----
>>> lock(sb_internal);
>>> lock(fs_reclaim);
>>> lock(sb_internal);
>>> lock(fs_reclaim);
>>> *** DEADLOCK ***
>>> 4 locks held by fsfreeze/4346:
>>> #0: 00000000b478ef56 (sb_writers#8){++++}, at: percpu_down_write+0xb4/0x650
>>> #1: 000000001ec487a9 (&type->s_umount_key#28){++++}, at: freeze_super+0xda/0x290
>>> #2: 000000003edbd5a0 (sb_pagefaults){++++}, at: percpu_down_write+0xb4/0x650
>>> #3: 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650
>>> stack backtrace:
>>> Call Trace:
>>> dump_stack+0xe0/0x19a
>>> print_circular_bug.isra.10.cold.34+0x2f4/0x435
>>> check_prev_add.constprop.19+0xca1/0x15f0
>>> validate_chain.isra.14+0x11af/0x3b50
>>> __lock_acquire+0x728/0x1200
>>> lock_acquire+0x269/0x5a0
>>> fs_reclaim_acquire.part.19+0x29/0x30
>>> fs_reclaim_acquire+0x19/0x20
>>> kmem_cache_alloc+0x3e/0x3f0
>>> kmem_zone_alloc+0x79/0x150
>>> xfs_trans_alloc+0xfa/0x9d0
>>> xfs_sync_sb+0x86/0x170
>>> xfs_log_sbcount+0x10f/0x140
>>> xfs_quiesce_attr+0x134/0x270
>>> xfs_fs_freeze+0x4a/0x70
>>> freeze_super+0x1af/0x290
>>> do_vfs_ioctl+0xedc/0x16c0
>>> ksys_ioctl+0x41/0x80
>>> __x64_sys_ioctl+0x73/0xa9
>>> do_syscall_64+0x18f/0xd23
>>> entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>> According to Dave Chinner:
>>> Freezing the filesystem, after all the data has been cleaned. IOWs
>>> memory reclaim will never run the above writeback path when
>>> the freeze process is trying to allocate a transaction here because
>>> there are no dirty data pages in the filesystem at this point.
>>> Indeed, this xfs_sync_sb() path sets XFS_TRANS_NO_WRITECOUNT so that
>>> it /doesn't deadlock/ by taking freeze references for the
>>> transaction. We've just drained all the transactions
>>> in progress and written back all the dirty metadata, too, and so the
>>> filesystem is completely clean and only needs the superblock to be
>>> updated to complete the freeze process. And to do that, it does not
>>> take a freeze reference because calling sb_start_intwrite() here
>>> would deadlock.
>>> IOWs, this is a false positive, caused by the fact that
>>> xfs_trans_alloc() is called from both above and below memory reclaim
>>> as well as within /every level/ of freeze processing. Lockdep is
>>> unable to describe the staged flush logic in the freeze process that
>>> prevents deadlocks from occurring, and hence we will pretty much
>>> always see false positives in the freeze path....
>>> Perhaps breaking the fs_reclaim pseudo lock into a per filesystem lock
>>> may fix the issue. However, that will greatly complicate the logic and
>>> may not be worth it.
>>> Another way to fix it is to disable the taking of the fs_reclaim
>>> pseudo lock when in the freezing code path as a reclaim on the freezed
>>> filesystem is not possible as stated above. This patch takes this
>>> approach by setting the __GFP_NOLOCKDEP flag in the slab memory
>>> allocation calls when the filesystem has been freezed.
>>> Without this patch, the command sequence below will show that the lock
>>> dependency chain sb_internal -> fs_reclaim exists.
>>> # fsfreeze -f /home
>>> # fsfreeze --unfreeze /home
>>> # grep -i fs_reclaim -C 3 /proc/lockdep_chains | grep -C 5 sb_internal
>>> After applying the patch, such sb_internal -> fs_reclaim lock dependency
>>> chain can no longer be found. Because of that, the locking dependency
>>> warning will not be shown.
>> There was an attempt to fix this in the past, but Dave rejected right
>> away for any workaround in xfs and insisted to make lockdep smarter
>> instead. No sure your approach will make any difference this time.
>> Good luck.
> /me wonders if you can fix this by having the freeze path call
> memalloc_nofs_save() since we probably don't want to be recursing into
> the fs for reclaim while freezing it? Probably not, because that's a
> bigger hammer than we really need here. We can certainly steal memory
> from other filesystems that aren't frozen.
> It doesn't solve the key issue that lockdep isn't smart enough to know
> that we can't recurse into the fs that's being frozen and therefore
> there's no chance of deadlock.

Lockdep only looks at all the possible locking chains to see if a
circular deadlock is possible. It doesn't have the smart to understand
filesystem internals. The problem here is caused by the fact that
fs_reclaim is a global pseudo lock that is acquired whenever there is a
chance that FS reclaim can happen. As I said in the commit log, it may
be possible to fix that by breaking up fs_reclaim into a set of
per-filesystem pseudo locks, but that will add quite a bit of complexity
to the code. That is why I don't want to go this route. This patch is
the least invasive that I can think of to address the problem without
inhibiting other valid lockdep checking.