Re: [PATCH] dax: fix deadlock in __dax_fault
From: Dave Chinner
Date: Tue Sep 29 2015 - 21:58:04 EST
On Tue, Sep 29, 2015 at 12:44:58PM +1000, Dave Chinner wrote:
> On Mon, Sep 28, 2015 at 04:40:01PM -0600, Ross Zwisler wrote:
> > > > 4) Test all changes with xfstests using both xfs & ext4, using lockep.
> > > >
> > > > Did I miss any issues, or does this path not solve one of them somehow?
> > > >
> > > > Does this sound like a reasonable path forward for v4.3? Dave, and Jan, can
> > > > you guys can provide guidance and code reviews for the XFS and ext4 bits?
> > >
> > > IMO, it's way too much to get into 4.3. I'd much prefer we revert
> > > the bad changes in 4.3, and then work towards fixing this for the
> > > 4.4 merge window. If someone needs this for 4.3, then they can
> > > backport the 4.4 code to 4.3-stable.
> > >
> > > The "fast and loose and fix it later" development model does not
> > > work for persistent storage algorithms; DAX is storage - not memory
> > > management - and so we need to treat it as such.
> > Okay. To get our locking back to v4.2 levels here are the two commits I think
> > we need to look at:
> > commit 843172978bb9 ("dax: fix race between simultaneous faults")
> > commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for DAX")
> Already testing a kernel with those reverted. My current DAX patch
> stack is (bottom is first commit in stack):
And just to indicate why 4.3 is completely unrealistic, let me give
you a summary of this patchset so far:
> f672ae4 xfs: add ->pfn_mkwrite support for DAX
I *think* it works.
> 6855c23 xfs: remove DAX complete_unwritten callback
> e074bdf Revert "dax: fix race between simultaneous faults"
> 8ba0157 Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX"
> a2ce6a5 xfs: DAX does not use IO completion callbacks
DAX still needs to use IO completion callbacks for the DIO path, so
needed rewriting. Made 6855c23 redundant.
> 246c52a xfs: update size during allocation for DAX
Fundamentally broken, so removed. DIO passes the actual size from IO
completion, not into block allocation, hence DIO still needs
completion callbacks. DAX page faults can't change
the file size (should segv before we get here), so need to
specifically handle that to avoid leaking ioend structures due to
incorrect detection of EOF updates due to ovreflows...
> 9d10e7b xfs: Don't use unwritten extents for DAX
Exposed a behaviour in DIO and DAX that results in s64 variable
overflow when writing to the block at file offset (2^63 - 1FSB).
Both the DAX and DIO code ask for a mapping at:
xfs_get_blocks_alloc: [...] offset 0x7ffffffffffff000 count 4096
which gives a size of 0x8000000000000000 (larger than
sb->s_maxbytes!) and results a sign overflow checking if a inode
size update is requireed. Direct IO avoids this overflow because
the logic checks for unwritten extents first and the IO completion
callback that has the correct size. Removing unwritten extent
allocation from DAX exposed this bug through firing asserts all
through the XFS block mapping and IO completion callbacks....
Fixed the overflow, testing got further and then fsx exposed another
problem similar to the size update issue above. Patch is
fundamentally broken: block zeroing needs to be driven all the way
into the low level allocator implementation to fix the problems fsx
> eaef807 xfs: factor out sector mapping.
Probably not going to be used now.
So, basically, I've rewritten most of the patch set once, and I'm
about to fundamentally change it again to address problems the first
two versions have exposed. Hopefully this will show you the
complexity of what we are dealing with here, and why I said this
needs to go through 4.4?
It should also help explain why I suggested that if ext4 developers
aren't interested in fixing DAX problems then we should just drop
ext4 DAX support? Making this stuff work correctly requires more
than just a cursory knowledge of a filesystem, and nobody actively
working on DAX has the qualifications to make these sorts of changes
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/