Re: [PATCH v3 1/3] lockdep: Make LOCKDEP_CROSSRELEASE configs all part of PROVE_LOCKING

From: Dave Chinner
Date: Tue Aug 22 2017 - 01:46:13 EST


On Mon, Aug 21, 2017 at 05:46:00PM +0200, Peter Zijlstra wrote:
>
>
> Booting the very latest -tip on my test machine gets me the below splat.
>
> Dave, TJ, FYI, lockdep grew annotations for completions; it remembers
> which locks were taken before we complete() and checks none of those are
> held while we wait_for_completion().
>
> That is, something like:
>
> mutex_lock(&A);
> mutex_unlock(&A);
> complete(&C);
>
> mutex_lock(&A);
> wait_for_completion(&C);

Ok, that seems reasonable...

> Is now considered a problem. Note that in order for C to link to A it
> needs to have observed the complete() thread acquiring it after a
> wait_for_completion(), something like:
>
> wait_for_completion(&C)
> mutex_lock(&A);
> mutex_unlock(&A);
> complete(&C);

Sure.

>
> That is, only locks observed taken between C's wait_for_completion() and
> it's complete() are considered.
>
> Now given the above observance rule and the fact that the below report
> is from the complete, the thing that happened appears to be:
>
>
> lockdep_map_acquire(&work->lockdep_map)
> down_write(&A)
>
> down_write(&A)
> wait_for_completion(&C)
>
> lockdep_map_acquire(&work_lockdep_map);
> complete(&C)
>
> Which lockdep then puked over because both sides saw the same work
> class.

Let me get this straight, first. If we:

1. take a lock in a work queue context; and
2. in a separate context, hold that lock while we wait for a
completion; and
3. Run the completion from the original workqueue where we
/once/ held that lock

lockdep will barf?

IIUC, that's a problem because XFS does this all over the place.
Let me pull the trace apart in reverse order to explain why XFS is
going to throw endless false positives on this:

> [ 39.159708] -> #0 ((&ioend->io_work)){+.+.}:
> [ 39.166126] process_one_work+0x244/0x6b0
> [ 39.171184] worker_thread+0x48/0x3f0
> [ 39.175845] kthread+0x147/0x180
> [ 39.180020] ret_from_fork+0x2a/0x40
> [ 39.184593] 0xffffffffffffffff

That's a data IO completion, which will take an inode lock if we
have to run a transaction to update inode size or convert an
unwritten extent.

> [ 39.100611] -> #1 (&xfs_nondir_ilock_class){++++}:
> [ 39.107612] __lock_acquire+0x10a1/0x1100
> [ 39.112660] lock_acquire+0xea/0x1f0
> [ 39.117224] down_write_nested+0x26/0x60
> [ 39.122184] xfs_ilock+0x166/0x220
> [ 39.126563] __xfs_setfilesize+0x30/0x200
> [ 39.131612] xfs_setfilesize_ioend+0x7f/0xb0
> [ 39.136958] xfs_end_io+0x49/0xf0
> [ 39.141240] process_one_work+0x273/0x6b0
> [ 39.146288] worker_thread+0x48/0x3f0
> [ 39.150960] kthread+0x147/0x180
> [ 39.155146] ret_from_fork+0x2a/0x40

That's the data IO completion locking the inode inside a transaction
to update the inode size inside a workqueue.


> [ 38.962397] -> #2 ((complete)&bp->b_iowait){+.+.}:
> [ 38.969401] __lock_acquire+0x10a1/0x1100
> [ 38.974460] lock_acquire+0xea/0x1f0
> [ 38.979036] wait_for_completion+0x3b/0x130
> [ 38.984285] xfs_buf_submit_wait+0xb2/0x590
> [ 38.989535] _xfs_buf_read+0x23/0x30
> [ 38.994108] xfs_buf_read_map+0x14f/0x320
> [ 38.999169] xfs_trans_read_buf_map+0x170/0x5d0
> [ 39.004812] xfs_read_agf+0x97/0x1d0
> [ 39.009386] xfs_alloc_read_agf+0x60/0x240
> [ 39.014541] xfs_alloc_fix_freelist+0x32a/0x3d0
> [ 39.020180] xfs_free_extent_fix_freelist+0x6b/0xa0
> [ 39.026206] xfs_free_extent+0x48/0x120
> [ 39.031068] xfs_trans_free_extent+0x57/0x200
> [ 39.036512] xfs_extent_free_finish_item+0x26/0x40
> [ 39.042442] xfs_defer_finish+0x174/0x770
> [ 39.047494] xfs_itruncate_extents+0x126/0x470
> [ 39.053035] xfs_setattr_size+0x2a1/0x310
> [ 39.058093] xfs_vn_setattr_size+0x57/0x160
> [ 39.063342] xfs_vn_setattr+0x50/0x70
> [ 39.068015] notify_change+0x2ee/0x420
> [ 39.072785] do_truncate+0x5d/0x90
> [ 39.077165] path_openat+0xab2/0xc00
> [ 39.081737] do_filp_open+0x8a/0xf0
> [ 39.086213] do_sys_open+0x123/0x200
> [ 39.090787] SyS_open+0x1e/0x20
> [ 39.094876] entry_SYSCALL_64_fastpath+0x23/0xc2

And that's waiting for a metadata read IO to complete during a
truncate transaction. This has no direct connection to the inode at
all - it's a allocation group header that we have to lock to do
block allocation. The completion it is waiting on doesn't even run
through the same workqueue as the ioends - ioends go through
mp->m_data_workqueue or mp->m_unwritten_workqueue, metadata buffers
go through mp->m_buf_workqueue or mp->m_log_workqueue.

So from that perspective, an ioend blocked on an inode lock should
not ever prevent metadata buffer completions from being run. Hence
I'd call this a false positive at best, though I think it really
indicates a bug in the new lockdep code because it isn't
discriminating between different workqueue contexts properly.

Even if I ignore the fact that buffer completions are run on
different workqueues, there seems to be a bigger problem with this
sort of completion checking.

That is, the trace looks plausible because we are definitely hold an
inode locked deep inside a truncate operation where the completion
if flagged. Indeed, some transactions that would flag like this
could be holding up to 5 inodes locked and have tens of other
metadata objects locked. There are potentially tens (maybe even
hundreds) of different paths into this IO wait point, and all have
different combinations of objects locked when it triggers. So
there's massive scope for potential deadlocks....

.... and so we must have some way of avoiding this whole class of
problems that lockdep is unaware of.

Indeed, I don't think we can get stuck here because of the IO
locking rules we have specifically to avoid this sort of issue.
Having IO in flight and running completions during a truncate
operation on the same inode is a Realy Bad Thing to be doing. It's
actually a corruption and/or data exposure vector, neither of which
make users happy, and so fileystems tend to have designed in
mechanisms for avoiding these problems.

In the case of XFS, it's the data IO serialisation model that
prevents such problems. i.e. the inode locks we hold at this point
in the truncate process (i.e. the XFS_IOLOCK a.k.a i_rwsem) prevent
new IO from being run, and we don't start the truncate until we've
waited for all in progress IO to complete. Hence while the truncate
runs and blocks on metadata IO completions, no data IO can be in
progress on that inode, so there is no completions being run on that
inode in workqueues.

And therefore the IO completion deadlock path reported by lockdep
can not actually be executed during a truncate, and so it's a false
positive.

Back to the drawing board, I guess....

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx