Re: shared/298 lockdep splat?

From: Dave Chinner
Date: Wed Sep 20 2017 - 18:23:06 EST


[cc lkml, PeterZ and Byungchul]

On Wed, Sep 20, 2017 at 02:10:42PM -0700, Darrick J. Wong wrote:
> Hello list,
>
> Yesterday I tried setting up qemu 2.10 with some (fake) nvdimms backed
> by an on-disk file. Midway through a -g auto xfstests run, shared/298
> produced the attached dmesg spew. I'll try to have a look later, but
> in the meantime I'm doing the 'complain to list, see if anyone bites'
> thing. :)
>
> The kernel is 4.14-rc1 without any patches applied.

tl;dr: it's a false positive because it's conflating the independent
locking/completion contexts of upper and lower filesystems when
the loop device is in use.

I've mangled the report in replying - Darrick posted it here, too:

https://pastebin.com/GshLpH7x

> ======================================================
> WARNING: possible circular locking dependency detected
> 4.14.0-rc1-fixes #1 Tainted: G W
> ------------------------------------------------------
> loop0/31693 is trying to acquire lock:
> (&(&ip->i_mmaplock)->mr_lock){++++}, at: [<ffffffffa00f1b0c>] xfs_ilock+0x23c/0x330 [xfs]
>
> but now in release context of a crosslock acquired at the following:
> ((complete)&ret.event){+.+.}, at: [<ffffffff81326c1f>] submit_bio_wait+0x7f/0xb0
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #2 ((complete)&ret.event){+.+.}:
> lock_acquire+0xab/0x200
> wait_for_completion_io+0x4e/0x1a0
> submit_bio_wait+0x7f/0xb0
> blkdev_issue_zeroout+0x71/0xa0
> xfs_bmapi_convert_unwritten+0x11f/0x1d0 [xfs]
> xfs_bmapi_write+0x374/0x11f0 [xfs]
> xfs_iomap_write_direct+0x2ac/0x430 [xfs]
> xfs_file_iomap_begin+0x20d/0xd50 [xfs]
> iomap_apply+0x43/0xe0
> dax_iomap_rw+0x89/0xf0
> xfs_file_dax_write+0xcc/0x220 [xfs]
> xfs_file_write_iter+0xf0/0x130 [xfs]
> __vfs_write+0xd9/0x150
> vfs_write+0xc8/0x1c0
> SyS_write+0x45/0xa0
> entry_SYSCALL_64_fastpath+0x1f/0xbe

This is on the lower filesystem - it's a write() syscall on a DAX
enable filesystem. Locking in this case is:

Context function lock
Lower filesystem xfs_file_dax_write XFS_IOLOCK_EXCL
Lower filesystem xfs_file_iomap_begin XFS_ILOCK_EXCL
Lower filesystem/
lower blkdev submit_bio_wait wait for dax block
device IO completion

Which gives: IOLOCK -> &xfs_nondir_ilock_class -> IO completion

In the lower filesystem.

> -> #1 (&xfs_nondir_ilock_class){++++}:
> lock_acquire+0xab/0x200
> down_write_nested+0x4a/0xb0
> xfs_ilock+0x263/0x330 [xfs]
> xfs_setattr_size+0x152/0x370 [xfs]
> xfs_vn_setattr+0x6b/0x90 [xfs]
> notify_change+0x27d/0x3f0
> do_truncate+0x5b/0x90
> path_openat+0x237/0xa90
> do_filp_open+0x8a/0xf0
> do_sys_open+0x11c/0x1f0
> entry_SYSCALL_64_fastpath+0x1f/0xbe

Context function lock
Lower filesystem xfs_vn_setattr XFS_IOLOCK_EXCL
Lower filesystem xfs_vn_setattr XFS_MMAPLOCK_EXCL
Lower filesystem xfs_setattr_size XFS_ILOCK_EXCL

Which gives: IOLOCK -> MMAPLOCK -> &xfs_nondir_ilock_class

i.e.: IOLOCK -> MMAPLOCK -> &xfs_nondir_ilock_class -> IO completion

In the lower filesystem.

>
> -> #0 (&(&ip->i_mmaplock)->mr_lock){++++}:
> up_write+0x1c/0x40
> xfs_iunlock+0x1d0/0x310 [xfs]
> xfs_file_fallocate+0x8a/0x310 [xfs]
> loop_queue_work+0xb7/0x8d0
> kthread_worker_fn+0xb9/0x1f0

Context function lock
Lower filesystem xfs_file_fallocate XFS_IOLOCK_EXCL
Lower filesystem xfs_file_fallocate XFS_MMAPLOCK_EXCL
[Lower filesystem whatever falloc calls XFS_ILOCK_EXCL]
Upper filesystem/
Upper blkd loop_queue_work blk_mq_complete_request

This is clearer when the reported stack is looked at:

> lock_commit_crosslock+0x3e9/0x5c0
> complete+0x24/0x60
> blk_update_request+0xc2/0x3e0
> blk_mq_end_request+0x18/0x80
> __blk_mq_complete_request+0x9f/0x170
> loop_queue_work+0x51/0x8d0
> kthread_worker_fn+0xb9/0x1f0
> kthread+0x148/0x180
> ret_from_fork+0x2a/0x40

So this is reporting as:

> Chain exists of:
> &(&ip->i_mmaplock)->mr_lock --> &xfs_nondir_ilock_class --> (complete)&ret.event
>
> Possible unsafe locking scenario by crosslock:
>
> CPU0 CPU1
> ---- ----
> lock(&xfs_nondir_ilock_class);
> lock((complete)&ret.event);
> lock(&(&ip->i_mmaplock)->mr_lock);
> unlock((complete)&ret.event);
>
> *** DEADLOCK ***

Which effectively means:

IO completion -> XFS_MMAPLOCK_EXCL .....

The thing is, this IO completion has nothing to do with the lower
filesystem - it's the IO completion for the filesystem on the loop
device (the upper filesystem) and is not in any way related to the
IO completion from the dax device the lower filesystem is waiting
on.

IOWs, this is a false positive.

Peter, this is the sort of false positive I mentioned were likely to
occur without some serious work to annotate the IO stack to prevent
them. We can nest multiple layers of IO completions and locking in
the IO stack via things like loop and RAID devices. They can be
nested to arbitrary depths, too (e.g. loop on fs on loop on fs on
dm-raid on n * (loop on fs) on bdev) so this new completion lockdep
checking is going to be a source of false positives until there is
an effective (and simple!) way of providing context based completion
annotations to avoid them...

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx