Re: shared/298 lockdep splat?

From: Byungchul Park
Date: Thu Sep 21 2017 - 04:47:28 EST


On Thu, Sep 21, 2017 at 08:22:56AM +1000, Dave Chinner wrote:
> > ======================================================
> > WARNING: possible circular locking dependency detected
> > 4.14.0-rc1-fixes #1 Tainted: G W
> > ------------------------------------------------------
> > loop0/31693 is trying to acquire lock:
> > (&(&ip->i_mmaplock)->mr_lock){++++}, at: [<ffffffffa00f1b0c>] xfs_ilock+0x23c/0x330 [xfs]
> >
> > but now in release context of a crosslock acquired at the following:
> > ((complete)&ret.event){+.+.}, at: [<ffffffff81326c1f>] submit_bio_wait+0x7f/0xb0
> >
> > which lock already depends on the new lock.
> >
> >
> > the existing dependency chain (in reverse order) is:
> >
> > -> #2 ((complete)&ret.event){+.+.}:
> > lock_acquire+0xab/0x200
> > wait_for_completion_io+0x4e/0x1a0
> > submit_bio_wait+0x7f/0xb0
> > blkdev_issue_zeroout+0x71/0xa0
> > xfs_bmapi_convert_unwritten+0x11f/0x1d0 [xfs]
> > xfs_bmapi_write+0x374/0x11f0 [xfs]
> > xfs_iomap_write_direct+0x2ac/0x430 [xfs]
> > xfs_file_iomap_begin+0x20d/0xd50 [xfs]
> > iomap_apply+0x43/0xe0
> > dax_iomap_rw+0x89/0xf0
> > xfs_file_dax_write+0xcc/0x220 [xfs]
> > xfs_file_write_iter+0xf0/0x130 [xfs]
> > __vfs_write+0xd9/0x150
> > vfs_write+0xc8/0x1c0
> > SyS_write+0x45/0xa0
> > entry_SYSCALL_64_fastpath+0x1f/0xbe
>
> This is on the lower filesystem - it's a write() syscall on a DAX
> enable filesystem. Locking in this case is:
>
> Context function lock
> Lower filesystem xfs_file_dax_write XFS_IOLOCK_EXCL
> Lower filesystem xfs_file_iomap_begin XFS_ILOCK_EXCL
> Lower filesystem/
> lower blkdev submit_bio_wait wait for dax block
> device IO completion
>
> Which gives: IOLOCK -> &xfs_nondir_ilock_class -> IO completion
>
> In the lower filesystem.
>
> > -> #1 (&xfs_nondir_ilock_class){++++}:
> > lock_acquire+0xab/0x200
> > down_write_nested+0x4a/0xb0
> > xfs_ilock+0x263/0x330 [xfs]
> > xfs_setattr_size+0x152/0x370 [xfs]
> > xfs_vn_setattr+0x6b/0x90 [xfs]
> > notify_change+0x27d/0x3f0
> > do_truncate+0x5b/0x90
> > path_openat+0x237/0xa90
> > do_filp_open+0x8a/0xf0
> > do_sys_open+0x11c/0x1f0
> > entry_SYSCALL_64_fastpath+0x1f/0xbe
>
> Context function lock
> Lower filesystem xfs_vn_setattr XFS_IOLOCK_EXCL
> Lower filesystem xfs_vn_setattr XFS_MMAPLOCK_EXCL
> Lower filesystem xfs_setattr_size XFS_ILOCK_EXCL
>
> Which gives: IOLOCK -> MMAPLOCK -> &xfs_nondir_ilock_class
>
> i.e.: IOLOCK -> MMAPLOCK -> &xfs_nondir_ilock_class -> IO completion
>
> In the lower filesystem.
>
> >
> > -> #0 (&(&ip->i_mmaplock)->mr_lock){++++}:
> > up_write+0x1c/0x40
> > xfs_iunlock+0x1d0/0x310 [xfs]
> > xfs_file_fallocate+0x8a/0x310 [xfs]
> > loop_queue_work+0xb7/0x8d0
> > kthread_worker_fn+0xb9/0x1f0
>
> Context function lock
> Lower filesystem xfs_file_fallocate XFS_IOLOCK_EXCL
> Lower filesystem xfs_file_fallocate XFS_MMAPLOCK_EXCL
> [Lower filesystem whatever falloc calls XFS_ILOCK_EXCL]
> Upper filesystem/
> Upper blkd loop_queue_work blk_mq_complete_request
>
> This is clearer when the reported stack is looked at:
>
> > lock_commit_crosslock+0x3e9/0x5c0
> > complete+0x24/0x60
> > blk_update_request+0xc2/0x3e0
> > blk_mq_end_request+0x18/0x80
> > __blk_mq_complete_request+0x9f/0x170
> > loop_queue_work+0x51/0x8d0
> > kthread_worker_fn+0xb9/0x1f0
> > kthread+0x148/0x180
> > ret_from_fork+0x2a/0x40
>
> So this is reporting as:
>
> > Chain exists of:
> > &(&ip->i_mmaplock)->mr_lock --> &xfs_nondir_ilock_class --> (complete)&ret.event
> >
> > Possible unsafe locking scenario by crosslock:
> >
> > CPU0 CPU1
> > ---- ----
> > lock(&xfs_nondir_ilock_class);
> > lock((complete)&ret.event);
> > lock(&(&ip->i_mmaplock)->mr_lock);
> > unlock((complete)&ret.event);
> >
> > *** DEADLOCK ***
>
> Which effectively means:
>
> IO completion -> XFS_MMAPLOCK_EXCL .....
>
> The thing is, this IO completion has nothing to do with the lower
> filesystem - it's the IO completion for the filesystem on the loop
> device (the upper filesystem) and is not in any way related to the
> IO completion from the dax device the lower filesystem is waiting
> on.
>
> IOWs, this is a false positive.
>
> Peter, this is the sort of false positive I mentioned were likely to
> occur without some serious work to annotate the IO stack to prevent
> them. We can nest multiple layers of IO completions and locking in
> the IO stack via things like loop and RAID devices. They can be
> nested to arbitrary depths, too (e.g. loop on fs on loop on fs on
> dm-raid on n * (loop on fs) on bdev) so this new completion lockdep
> checking is going to be a source of false positives until there is
> an effective (and simple!) way of providing context based completion
> annotations to avoid them...

Hello,

It looks caused by that &ret.event in submit_bio_wait() is initialized
with the same class for all layers. I mean that completion variables in
different layers should be initialized with different classes, as you do
for typical locks in xfs.

I am not sure if I understand how xfs works correctly. Right? If yes,
how can we distinguish between independent 'bio's in submit_bio_wait()?
You or I can make it work with the answer. No?

> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx