Re: [PATCH] dax: fix deadlock in __dax_fault

From: Dave Chinner
Date: Wed Sep 23 2015 - 22:54:58 EST

Next message: Herbert Xu: "Re: [PATCH v2] netlink: Replace rhash_portid with bound"
Previous message: Myron Stowe: "[RFC] PCI: Unassigned Expansion ROM BARs"
In reply to: Ross Zwisler: "[PATCH] dax: fix deadlock in __dax_fault"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, Sep 23, 2015 at 02:40:00PM -0600, Ross Zwisler wrote:
> Fix the deadlock exposed by xfstests generic/075. Here is the sequence
> that was causing us to deadlock:
>
> 1) enter __dax_fault()
> 2) page = find_get_page() gives us a page, so skip
> i_mmap_lock_write(mapping)
> 3) if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page)
> passes, enter this block
> 4) if (vmf->flags & FAULT_FLAG_WRITE) fails, so do the else case and
> i_mmap_unlock_write(mapping);
> return dax_load_hole(mapping, page, vmf);
>
> This causes us to up_write() a semaphore that we weren't holding.
>
> The up_write() on a semaphore we didn't down_write() happens twice in
> a row, and then the next time we try and i_mmap_lock_write(), we hang.
>
> Signed-off-by: Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx>
> Reported-by: Dave Chinner <david@xxxxxxxxxxxxx>
> ---
> fs/dax.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 7ae6df7..df1b0ac 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -405,7 +405,8 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> if (error)
> goto unlock;
> } else {
> - i_mmap_unlock_write(mapping);
> + if (!page)
> + i_mmap_unlock_write(mapping);
> return dax_load_hole(mapping, page, vmf);
> }
> }

I can't review this properly because I can't work out how this
locking is supposed to work. Captain, we have a Charlie Foxtrot
situation here:

page = find_get_page(mapping, vmf->pgoff)
if (page) {
....
} else {
i_mmap_lock_write(mapping);
}

So if there's no page in the page cache, we lock the i_mmap_lock.
The we have the case the above patch fixes. Then later:

if (vmf->cow_page) {
.....
if (!page) {
/* can fall through */
}
return VM_FAULT_LOCKED;
}

Which means __dax_fault() can also return here with the
i_mmap_lock_write() held. There's no documentation to indicate why
this is valid, and only by looking about 4 function calls higher up
the stack can I see that there's some attempt to handle this
*specific return condition* (in do_cow_fault()). That also is
lacking in documentation explaining the circumstances where we might
have the i_mmap_lock_write() held and have to release it. (Not to
mention the beautiful copy-n-waste of the unlock code, either.)

The above code in __dax_fault() is then followed by this gem:

/* Check we didn't race with a read fault installing a new page */
if (!page && major)
page = find_lock_page(mapping, vmf->pgoff);

if (page) {
/* mapping invalidation .... */
}
.....

if (!page)
i_mmap_unlock_write(mapping);

Which means that if we had a race with another read fault, we'll
remove the page from the page cache, insert the new direct mapped
pfn into the mapping, and *then fail to unlock the i_mmap lock*.

Is this supposed to work this way? Or is it another bug?

Another difficult question this change of locking raised that I
can't answer: is it valid to call into the filesystem via getblock()
or complete_unwritten() while holding the i_mmap_rwsem? This puts
filesystem transactions and locks inside the scope of i_mmap_rwsem,
which may have impact on the fact that we already have an inode lock
order dependency w.r.t. i_mmap_rwsem through truncate (and probably
other paths, too).

So, please document the locking model, explain the corner cases and
the intricacies like why *unbalanced, return value conditional
locking* is necessary, and update the charts of lock order
dependencies in places like mm/filemap.c, and then we might have
some idea of how much of a train-wreck this actually is....

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Herbert Xu: "Re: [PATCH v2] netlink: Replace rhash_portid with bound"
Previous message: Myron Stowe: "[RFC] PCI: Unassigned Expansion ROM BARs"
In reply to: Ross Zwisler: "[PATCH] dax: fix deadlock in __dax_fault"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]